CALIBURN
← Back

Building a pragmatic evaluation harness

7/12/2025

Start tiny, measure the right things, and keep evals next to the code so they run every time you ship.

Three levels of eval

  1. Unit checks — deterministic validations for parsers, extractors, tool I/O.
  2. Task evals — small golden sets (50–200) with pass/fail rubrics.
  3. Ops metrics — latency, cost, review rate, and business KPIs.

Assembling the golden set

A simple harness shape

tests/             # YAML/JSON cases
runner.{py,ts}     # loads case → calls pipeline → checks rules → logs deltas
report/            # pass %, common fails, cost Δ, example diffs

Track on every PR

A harness you use beats a perfect one you don’t. Keep it small, near the code, and relentlessly tied to the business outcome.