Evals for LLM Applications
An eval is a test case for an LLM app. You collect inputs that matter, define a scorer that checks the output, and run the pair against each candidate prompt or model. Without evals you are shipping based on vibes.
The shape of an eval
- Dataset. A frozen list of inputs — user queries, documents, tool histories. Typically 20 for fast smoke tests, 200+ for release gates.
- Scorer. A function from
(input, output, expected)to a number in [0, 1]. Can be deterministic (regex match, JSON validity), LLM-judged (rubric), or human-labeled. - Run. Execute the system under test across every dataset row, aggregate scores, diff vs. the previous run.
Types of scorers
- Exact match / regex — works for classification, structured extraction.
- Reference-based — compare against a gold answer with embedding similarity or ROUGE. Cheap; brittle on open-ended tasks.
- LLM-as-judge — ask another LLM to score against a rubric. Scales well; prompt the judge carefully or it becomes the bottleneck.
- Human review — slow and expensive, still the ground truth for judgment-heavy tasks.
Running evals in CI
Gate PRs on eval regressions. A good setup:
- Sample 30 fast cases that run in under 60 seconds.
- Fail the build if the win-rate vs. main drops below 95%.
- Run the full 500-case suite nightly.
- Block release if any critical slice (top-10 customers, top-5 intents) regresses.
Failure modes
- Judge drift. Your LLM judge changes scoring behavior after a model upgrade. Freeze the judge model version.
- Dataset contamination. Your team writes eval cases from memory of bugs you fixed. The model looks great, production still breaks. Sample real traffic.
- Single-score tyranny. An overall mean of 0.82 hides that finance queries dropped from 0.9 to 0.5. Always report per-slice.
- Gaming. Engineers optimize the eval instead of the product. Refresh the dataset monthly.
When NOT to invest heavily
If you have fewer than 100 daily users and the product is pre-PMF, building a full eval harness is premature. Ship, watch real sessions, and write evals from the first complaints.