Building a pragmatic evaluation harness
7/12/2025
Start tiny, measure the right things, and keep evals next to the code so they run every time you ship.
Three levels of eval
- Unit checks — deterministic validations for parsers, extractors, tool I/O.
- Task evals — small golden sets (50–200) with pass/fail rubrics.
- Ops metrics — latency, cost, review rate, and business KPIs.
Assembling the golden set
- Pull anonymized samples across seasons, segments, and edge cases.
- Write short rubrics (e.g., “summary includes who/what/when, 120–200 words”).
- Use two-rater agreement for subjective tasks, then freeze.
A simple harness shape
tests/ # YAML/JSON cases runner.{py,ts} # loads case → calls pipeline → checks rules → logs deltas report/ # pass %, common fails, cost Δ, example diffs
Track on every PR
- Structured-output validity (parse rate, missing fields)
- Hallucination rate (facts unsupported by retrieved sources)
- Cost & latency deltas
- Degradation vs. last “good” model + prompt snapshot
A harness you use beats a perfect one you don’t. Keep it small, near the code, and relentlessly tied to the business outcome.