Why we stopped writing acceptance criteria before prompts

For years we wrote AI-feature tickets the same way we wrote CRUD tickets - with strict acceptance criteria. The result was a graveyard of test cases that no model could ever satisfy in the same shape twice.

[ Optional image: eval-harness dashboard ]

Why acceptance criteria don't fit AI features

Acceptance criteria assume a deterministic specification. The same input produces the same output, every time. Prompts and language models do not work that way - they produce a distribution of outputs. Mixing the two paradigms produces neither: tickets that can't be marked done, and features that can't be measured.

So we changed the order. The eval is the spec. The prompt is the implementation. The acceptance criterion becomes a number on a dashboard, not a yes/no on a ticket.

What an eval-first workflow looks like

Three things, in this order. First, every AI feature opens with a one-page eval doc - inputs, expected qualitative properties, scoring rubric. Second, the prompt is the last thing we write, not the first. Third, we set a quality floor in CI; if the eval drops below it on a PR, the PR fails.

An eval is essentially a test runner for non-deterministic code. You give it inputs, expected qualitative properties, and a grader. The grader is usually another model graded against a hand-written rubric, or a deterministic check (regex, schema validation, cost ceiling).

What this changed for us

Three things. AI features stopped landing "done" with subjective sign-off. Prompt regressions stopped slipping into production - we caught them in CI. And junior engineers stopped guessing about whether a prompt was "good enough." The number told them.

The takeaway

You can't ship reliable AI by inspection. You ship it by measurement. Anything else is wishful thinking dressed as engineering. Move your spec into evals, your implementation into prompts, and your judgment into the rubric. Then ship.

Why we stopped writing acceptance criteria before prompts.

Why acceptance criteria don't fit AI features

What an eval-first workflow looks like

What this changed for us

The takeaway