Skip to content

date: 2026-06-04 tags: [evals, ai-quality, harness, find-harness-improvements] status: active graduated_to:

Pairwise LLM-as-judge as a gate — the subjective twin of harness:check

A forward learning, not a footgun. Recorded as the seed find-harness-improvements step 2 ("mine our own learnings first") will surface in a future audit.

Pattern — to gate subjective output quality (where there's no single correct answer, so a deterministic assertion can't decide it), don't score one output against an absolute rubric. Instead generate two versions, hand both to an LLM judge, and ask which is better. Pairwise comparison needs no labelled ground truth and is far more stable than absolute scoring — the judge only has to pick a winner, not invent a number. Run it like a test: a regression fails when the new version loses to the baseline.

Why it matters — Tempo already has a deterministic harness gate (php artisan harness:check) and a deterministic code gate (composer ci:check). Neither can speak to the thing the product actually lives or dies on — is the nudge motivating, is a Blade prompt producing a good draft? That's a subjective axis with no deterministic oracle. Pairwise-judge-as-a-gate is the subjective twin of harness:check: the missing half of "everything that ships is gated." A session-long find-harness-improvements run (mining the LangChain / LangSmith eval material) flagged this as the single biggest open eval gap.

Where it plugs in

  • Harness side — a candidate qa-harness lens (judgment-based, so a lens, not a mechanically-decidable harness:check rule) — the subjective counterpart to the structural invariants harness:check already enforces.
  • Product side — the Tempo-app implementation (eval loop over resources/views/prompts/ + the daily nudge, gated like a test) is tracked as its own product issue, #137, under epic #136.

Statusseed: vetted ADOPT in the LangChain study (#130), not yet built. Provenance: LangSmith evaluation docs, via the find-harness-improvements run captured in #130. Picked up here so the decision isn't re-litigated and a future audit knows it's already judged worth doing.