Skip to main content

Epistemic Validation

This is the heart of Waypoint. Most tools answer the question "did it pass?" with a single bit — exit code 0. Waypoint answers a harder, more useful question: "what holds, why does it hold, and what kind of evidence does that rest on?" It does this by emitting, on every run, a signed proof certificate that you (or a CI gate, or a third party) can verify without re-running anything.

A passing run produces one bit of information: exit code 0. This is the architecture of a test runner, not a proof system.

That line, from the architecture decision that reframed Waypoint around proofs, is the whole motivation in one sentence. The rest of this page is what replaced it. For the command and flag specifics, see the CLI Reference; for the motivation in full, see Why Waypoint.

What waypoint validate proves

Running waypoint validate establishes five things about your artifact set:

  1. Schema conformance — every JSON artifact validates against its schema.
  2. Cross-reference integrity — every slug reference resolves to an artifact that exists.
  3. Orphan detection — no unreferenced personas, pain points, user goals, or product goals.
  4. Structural integrity — features have flows, flow steps are sequential, demo screenshots exist on disk, declared page routes match the app's router, declared CLI commands match the binary.
  5. Requirement coverage — application capabilities are captured as EARS requirements.

A passing run means, in the CLI's own words, "all artifacts pass schema validation, all cross-references resolve, and the artifacts directory is cohesive per the embedded validation rules." But the interesting part isn't the verdict — it's the artifact the verdict comes packaged in.

Checks versus proofs: the reframe

A check produces a verdict. A proof produces a certificate — a structured, machine-checkable artifact that encodes why the verdict holds: which inputs were examined, which rules were applied, and which prior results each rule depended on. A certificate can be verified by a third party without re-running anything.

That difference cashes out in four ways:

  • Certificates. The output is auditable, not opaque. You can trace any verdict to its evidence.
  • Provenance. Each node records the exact content hashes of the artifacts it read — which is what makes a verdict invalidatable when those inputs change.
  • Obligation generation. A proof system can be inverted: given partial artifacts and the constraint set, it can enumerate what must be created for a rule to pass — turning a failure into a to-do list.
  • Incremental re-checking. Because the tree records which results depend on which artifact hashes, a single edit re-evaluates only the affected nodes.

The proof tree

The certificate is a tree. The root is a single claim about the whole project; its descendants decompose that claim down to mechanical evidence at the leaves. The tree distinguishes two kinds of node:

  • Conclusion nodes (the non-leaves) are authored catalog entries that make a claim. Each carries human-readable what / why / prevents text. A conclusion is atomic if its children are raw evidence, or composite if its children are other conclusions.
  • Primitive nodes (the leaves) carry one piece of evidence from one source — a file:line citation, a test count plus seed and counterexample, an LLM rationale. Primitives carry evidence, never claims.

The root claim is project.is_validated:

what: The project's declared intent is realized by an architecture whose code conforms to its artifacts and whose behavior matches its declared invariants. why: A project may compile and run while silently breaking what it was specified to do. prevents: Production bugs caused by code drifting from spec without anyone noticing.

This two-layer structure lets the same tree serve three audiences from three altitudes. A non-technical reader stops at the composite conclusions and reads business-language claims. An engineer drills into the atomic conclusions for the technical claim. A reviewer drills all the way into the primitives to inspect the mechanical evidence.

A worked example, rendered as a tree, makes the shape concrete:

project.is_validated ✓ [S,R,Q]
├─ architecture.is_sound ✓ [S,R] (section, composite)
│ └─ booking.is_correct (aggregate) ✓ [S,R] (per-artifact composite)
│ ├─ booking.no_mutation_on_reject ✓ [S] (atomic conclusion)
│ │ ├─ source.codespec ✓ (booking.rs:42) ← primitive leaf
│ │ └─ source.proptest ✓ (1000 cases, seed 0x4a7b1c, 0 shrinks)
│ └─ booking.invariants_hold ✓ [S,R]
│ └─ source.unit_test ✓ (12 passed)
├─ requirements.are_complete ✓ [S]
│ └─ ...
└─ features.are_implemented ✓ [S,R,Q]
└─ checkout.flow_is_coherent ✓ [Q]
└─ source.llm ✓ ("flow tells a complete, plausible story")

Notice the tags in brackets — that is the second half of the model.

S / R / Q: what kind of evidence is this?

Every conclusion carries a reasoning tag set, drawn from three letters:

  • S — statically derivable. Fed by primitive sources schema, prolog, codespec, and tooling. The strongest kind of evidence: derivable from the schema, the rules over the artifact graph, the code-to-artifact match, and language-level guarantees, without running anything.
  • R — runtime evidence. Fed by unit_test and proptest. The claim is evidenced by tests that actually ran, not proven from structure.
  • Q — qualitative review. Fed by llm. The claim rests on an LLM-as-judge reading the artifacts and forming an opinion.

There are seven primitive proof-source kinds feeding these tags: schema (the JSON Schema validator), prolog (rules over the artifact graph), codespec (the "code matches the artifact" check — produced by the code generator on day one and an artifact-aware linter plugin on day N), tooling (language-guarantee facts), unit_test, proptest (property tests with shrinking), and llm.

Children infect parents

The crucial property: a composite's tag set is the union of its children's tags. Tags propagate up the tree, so the root's tags expose the entire evidence surface of the verdict at a glance:

  • project.is_validated ✓ [S] means the entire verdict is pure static proof.
  • project.is_validated ✓ [S, R, Q] means some branch somewhere rests on runtime evidence and some branch somewhere rests on LLM judgment.

Tags are diagnostic, not categorical — they don't grade the verdict, they tell you what kind of knowledge it rests on. A green [S] root is a far stronger green than a green [S, R, Q] root, and the system refuses to hide that distinction.

:::warning The Q tier is judgment, not proof A conclusion tagged Q rests on an LLM review. That review can be wrong. Waypoint's discipline is to make that explicit — to surface the Q tag honestly so a human can weigh it — not to launder a judgment into something that looks like a proof. Treat a Q-only branch as an informed opinion, not a guarantee. :::

Justification discipline

Reinforcing the honesty: any atomic conclusion whose reasoning is not exactly {S} must carry a written justification explaining why it isn't purely statically derivable — a decidability gap, a language-guarantee gap, or a soundness gap. The discipline is to push back: a justification that reads "easier to test" is wrong, and the right response is to ask whether more could be declared in the schema to shift the claim up to a static source. The system is biased toward upgrading evidence into proof over time, rather than settling for the weakest source that happens to work.

Memoization: only recompute what changed

Validation memoizes every primitive verdict and replays it on the next run, so a single edit doesn't re-prove the whole project. The cache lives in the committed waypoint.proof.json and each verdict is keyed by:

ruleID + clauseHash + InputHashes + toolchainKey
  • ruleID — which rule.
  • clauseHash — the rule's own logic; change the rule body and its cached verdicts invalidate.
  • InputHashes — the content hashes of every artifact (and code file) the rule read. Touch one of those files and only the rules that read it recompute.
  • toolchainKey — the relevant tool versions.

Q-tier (LLM) rules additionally fold in the SHA-256 of the effective reviewAgent configuration from waypoint.json. So bumping the LLM model invalidates every cached Q-tier verdict, while bumping a single artifact still only invalidates the rules whose declared inputs mention that file.

The net effect: re-validating after one small edit is near-instant in the common case, which is what makes it practical to validate after every step rather than only at the end. The --cache-stats flag reports hits, misses, and per-rule recompute times — see the CLI Reference for the exact output and flags.

Validate is read-only

One property is load-bearing for the agentic story: waypoint validate does not write source code. Authoring schema-validates as you go; a separate, explicit step generates code; and validation only proves conclusions — it never overwrites a file. Generation-as-a-side-effect-of-validation would be a footgun: an agent fills in a stub body, validation overwrites it, work is lost. Day-N drift between code and artifacts is caught by checking, not by overwriting. That is what makes the validator a safe oracle: an agent can call it as often as it likes and its ground truth never mutates the work.

The certificate, signed and verifiable

Every run — pass or fail — produces a certificate with the same structure. A failed project gets a certificate that records what failed and why, in exactly the form a passing one would. The full certificate is content-addressed and stored out of the repo; the committed waypoint.proof.json is a lightweight index — {certificateHash, commitSHA, generatedAt, summary} — that gives code review a clear signal that a commit's proof state changed.

The certificate is signed, and the signing key plus the certificate schema are public, even though the rule implementations, the artifact content (hashes only), and the LLM prompts are not. This is the TLS model: a certificate authority's domain-validation process is private, but the certificate and its verification are public. Anyone can confirm the signature, confirm the hash matches the nodes, and confirm specific rules passed against specific artifact hashes. (The CLI's verify path checks exactly this — structural validity is independent of pass/fail.)

Honest limits

Waypoint is precise about what it does and does not prove. Keep these caveats in mind:

  • Most shipped nodes are instance proofs, not universal proofs. An instance proof says "this model was checked." A universal proof says "no valid model can violate this rule" — and those are solver-backed and largely still ahead of the shipped system. Don't read a green root as "universally proven everywhere"; read it as "this project, at this revision, checked out — and here is the evidence surface."

  • The Q (LLM) tier is judgment, not proof. As above: an LLM review is an opinion the system is honest about, not a guarantee.

  • Incremental re-evaluation is partial. The memoization framework is in place and the common single-edit case is fast, but full hot-path wiring of "re-evaluate only stale nodes" everywhere is still being completed. Treat fast incremental re-validation as the design intent that is largely — not yet universally — realized.

  • Runtime tiers depend on tests existing. R-tagged conclusions are only as strong as the unit_test / proptest evidence behind them; a claim with no tests to run can't be evidenced at runtime.

These limits are the point, not an embarrassment. The reasoning tags and justification discipline exist precisely so the certificate never overstates itself: it tells you, honestly, whether a verdict rests on proof, on evidence, or on judgment — and lets you calibrate your confidence accordingly.

Where to go next