If you're shipping LLM features without an eval harness, you're shipping vibes. Here's the minimum viable harness we put in place on every AI engagement.
What an eval harness is
A test runner for non-deterministic code. You give it inputs, expected qualitative properties, and a grader. The grader can be deterministic (a regex, a schema check, a cost ceiling) or evaluative (another model with a hand-written rubric).
The MVP
- A YAML or JSON file with 50-200 input/expected pairs covering happy paths, edge cases, and known-bad inputs.
- A grader that produces a score between 0 and 1 per case.
- A runner that aggregates scores per category and writes a markdown report.
- A CI job that fails if the rolling score drops more than X% from baseline.
Why this matters in production
Without an eval, the only feedback signal on a prompt change is "does it still feel good?" - a signal that scales poorly past one engineer and one prompt. With an eval, you get an objective number that survives team changes, model upgrades, and the inevitable prompt drift over time.
Common mistakes
Three. Treating the eval set as static (it should grow with every bug). Grading only with another LLM (mix in deterministic checks). And forgetting cost - an eval should track tokens-per-call as a first-class metric, not an afterthought.