Skip to main content

Why Waypoint

A passing test suite tells you almost nothing.

That sounds wrong, so let's be precise. When your test runner exits, it hands back exactly one bit of information: exit code 0. It does not tell you what held, why it held, which inputs were examined, or which rules were applied. You cannot hand that bit to a third party and have them confirm anything. You cannot invert it into "here is what's still missing." You cannot re-use it incrementally when one file changes. It is the architecture of a test runner — and a test runner is not a proof system.

A passing run produces one bit of information: exit code 0. This is the architecture of a test runner, not a proof system.

For a human glancing at CI, one bit is often enough — you wrote the tests, you trust your judgment, you move on. But the moment you hand the keys to an autonomous agent running for hours without supervision, one opaque bit is nowhere near enough to build trust on. This page is about what you need instead, and why getting it right is what lets agents run longer and more confidently.

From checks to proofs

Waypoint replaces the check with a proof.

A check produces a verdict. A proof produces a certificate — a structured, machine-checkable artifact that encodes why the verdict holds: which inputs were examined, which rules were applied, and which prior results each rule depended on. A certificate can be verified by a third party without re-running anything.

That shift buys four things a plain check can't offer:

  • Certificates. Every waypoint validate run emits waypoint.proof.json — a record of the derivation, not just its conclusion. A failed project gets a certificate too, with the same structure as a passing one, recording exactly what failed and why.
  • Obligation generation. A proof system can run backwards. Given the artifacts you have and the constraints you must satisfy, it can enumerate what must still exist to satisfy them — turning a failure into a concrete next task.
  • Incremental re-checking. Each node in the proof records the content hashes of the artifacts it read. Change one artifact, and only the nodes that cited it need to recompute.
  • Attributable evidence. Every conclusion is tagged with what kind of knowledge it rests on (more on this below), so "proven" and "merely plausible" never get confused.

What "epistemic validation" means here

Waypoint's validation is epistemic in a specific sense: every verdict is justified, attributable, and replayable.

  • Justified. Each conclusion node states a claim in plain language — what it asserts, why it matters, and what it prevents — and bottoms out in concrete evidence: a file and line, a test count and seed, a counterexample, an LLM rationale.

  • Attributable. Each conclusion carries a set of reasoning tags saying exactly what kind of evidence underpins it:

    • S — statically derivable. Proven from schema, rules over the artifact graph, code-to-spec checks, or a language guarantee. This is the strongest tag: a universal claim, not a sample.
    • R — runtime evidence. Backed by a unit test or a property test (which actually ran).
    • Q — qualitative review. An LLM-as-judge verdict. This is judgment, not proof.

    Tags propagate upward — children infect parents — so the project root exposes the full evidence surface at a glance. A green root tagged [S] means the entire verdict is pure static proof; a green root tagged [S, R, Q] means some branch rests on a test that ran and some branch rests on a human-style judgment. The tags are diagnostic, not decorative: they tell you precisely how much to trust the green.

  • Replayable. The certificate is signed and pinned to content hashes. Anyone can re-confirm the verdict against those hashes without re-running, and confirm the certificate wasn't tampered with — the same model as a TLS certificate, where the issuing process is private but verification is public.

This is the opposite of "exit code 0": a single opaque bit with no explanation, no provenance, and no way to audit it.

The central argument: epistemic validation unlocks longer, more confident agentic sessions

Here is the thesis, plainly: strict, epistemic validation of intent is what lets an agent run longer and more confidently without a human in the loop.

An autonomous coding agent's characteristic failure mode is drift. With each step it strays a little further from what was actually wanted, because it has no authoritative, machine-checkable statement of "correct" to anchor against. Asking it to vibe-check its own output — "does this look right?" — compounds the error. It's the model grading its own homework with no answer key.

Waypoint hands the agent the answer key as a first-class, external artifact. Six properties make that answer key usable inside an agent's inner loop.

1. A provable ground truth

The artifact DAG is the spec, and the root claim project.is_validated is a single, decomposable, signed verdict on whether the code realizes that spec:

what: The project's declared intent is realized by an architecture whose code conforms to its artifacts and whose behavior matches its declared invariants.

why: A project may compile and run while silently breaking what it was specified to do.

The agent stops asking "does this look right?" and starts asking "does waypoint validate prove the root claim — and if not, exactly which node failed?" That target is crisp, attributable, and non-negotiable.

2. A feedback loop tight enough for the inner loop

Verdicts are memoized by input hash, and the reasoning engine is demand-driven — irrelevant rules never run. Re-validating after a single edit is near-instant in the common case, because only the rules whose declared inputs mention the touched file recompute. That speed is what lets the agent validate after every small step instead of at the end of a giant batch — catching drift while it's still one node wide rather than spread across a session's worth of work.

note

The caching is content-addressed. The memo key folds in the rule, its clause body, the hashes of the inputs it read, and the toolchain; LLM-tier verdicts additionally fold in the review-agent configuration. Touch one artifact and only the rules that read it recompute. See the CLI Reference for the cache mechanics.

3. A read-only oracle that never sabotages the work

waypoint validate writes no source code. Validation and code generation are separate stages by design — because generation-as-a-side-effect-of-validation is a footgun: the agent fills stub bodies, validation overwrites them, and work is lost. Day-N drift between code and artifacts is caught by checking, not by overwriting.

Drift is enforced by checking, not by overwriting.

This is the safety property that makes high-frequency self-checking viable: the agent can lean on the validator as hard as it wants, as often as it wants, and the ground truth never mutates its progress. The oracle is idempotent and read-only.

4. Calibrated confidence

Because of the S/R/Q tags and the discipline that any non-static conclusion must justify why it isn't purely provable, the agent — and its human supervisor — can tell the difference between "proven" and "merely evidenced." A green [S] root is a far stronger green than [S, R, Q] with a Q branch resting on an LLM judgment. The agent can be appropriately confident: bold where the proof is universal, cautious where it rests on a sample or a judgment. Confidence becomes calibrated rather than uniform.

5. Self-direction via obligations

When a node fails, the proof system can invert the constraint into "what must exist for this to pass," handing the agent its next task without a human framing it. The loop becomes:

validate → read the failed node's obligations → author the named artifact
or fill the named stub → re-validate → repeat

The agent drives itself down the DAG, closing obligations one at a time, instead of waiting for a human to decompose the work.

6. Durable cross-session checkpoints

The signed certificate is a portable proof that this state was coherent at commit X. A new session — or a different agent entirely — resumes from a verified, warm baseline instead of re-deriving trust from scratch, and a human or CI gate can audit any checkpoint independently.

Net: the human moves up to intent — authoring and curating artifacts, approving the proof tree — and the agent is freed for long stretches of unsupervised execution, because the validator is a trustworthy, fast, read-only, self-explaining oracle for "am I still building the right thing, and can I prove it?" That's the division of labor Waypoint is built around: human designed, AI directed — AI that picks up what you put down.

Being honest about the limits

The strongest case is the honest one, so:

caution
  • The Q (LLM) tier is judgment, not proof. An LLM-as-judge verdict can be wrong. Waypoint's discipline is to make that explicit with a tag and a required justification — not to pretend the judgment is proof.
  • Most shipped conclusions are instance proofs plus strong static evidence, not universal proofs. "This model was checked" is the common case; "no valid model can violate this rule" (solver-backed, universal) is the aspiration the system pushes toward over time. Waypoint does not claim universal proofs everywhere.
  • Some incremental machinery is partial. The framework for recomputing only stale nodes is in place, but full hot-path wiring and solver-backed universal proofs are still being completed. The honest framing is "fast incremental re-checking for the common case," not "fully incremental everywhere."
  • Runtime tiers depend on tests existing. R-tier evidence is only as good as the tests that were emitted and actually ran.

None of these weaken the core argument. Even with the Q tier acknowledged as judgment and universal proofs treated as aspirational, a justified, attributable, replayable verdict over a structured spec is categorically more trustworthy than one opaque bit — and that's exactly what an agent needs to run longer without a human babysitting it.

Next steps