Building a pragmatic evaluation harness

7/12/2025

Start tiny, measure the right things, and keep evals next to the code so they run every time you ship.

Three levels of eval

Unit checks — deterministic validations for parsers, extractors, tool I/O.
Task evals — small golden sets (50–200) with pass/fail rubrics.
Ops metrics — latency, cost, review rate, and business KPIs.

Assembling the golden set

Pull anonymized samples across seasons, segments, and edge cases.
Write short rubrics (e.g., “summary includes who/what/when, 120–200 words”).
Use two-rater agreement for subjective tasks, then freeze.

A simple harness shape

tests/             # YAML/JSON cases
runner.{py,ts}     # loads case → calls pipeline → checks rules → logs deltas
report/            # pass %, common fails, cost Δ, example diffs

Track on every PR

Structured-output validity (parse rate, missing fields)
Hallucination rate (facts unsupported by retrieved sources)
Cost & latency deltas
Degradation vs. last “good” model + prompt snapshot

A harness you use beats a perfect one you don’t. Keep it small, near the code, and relentlessly tied to the business outcome.