Evals for LLM Applications

An eval is a test case for an LLM app. You collect inputs that matter, define a scorer that checks the output, and run the pair against each candidate prompt or model. Without evals you are shipping based on vibes.

The shape of an eval

Dataset. A frozen list of inputs — user queries, documents, tool histories. Typically 20 for fast smoke tests, 200+ for release gates.
Scorer. A function from (input, output, expected) to a number in [0, 1]. Can be deterministic (regex match, JSON validity), LLM-judged (rubric), or human-labeled.
Run. Execute the system under test across every dataset row, aggregate scores, diff vs. the previous run.

Types of scorers

Exact match / regex — works for classification, structured extraction.
Reference-based — compare against a gold answer with embedding similarity or ROUGE. Cheap; brittle on open-ended tasks.
LLM-as-judge — ask another LLM to score against a rubric. Scales well; prompt the judge carefully or it becomes the bottleneck.
Human review — slow and expensive, still the ground truth for judgment-heavy tasks.

Running evals in CI

Gate PRs on eval regressions. A good setup:

Sample 30 fast cases that run in under 60 seconds.
Fail the build if the win-rate vs. main drops below 95%.
Run the full 500-case suite nightly.
Block release if any critical slice (top-10 customers, top-5 intents) regresses.

Failure modes

Judge drift. Your LLM judge changes scoring behavior after a model upgrade. Freeze the judge model version.
Dataset contamination. Your team writes eval cases from memory of bugs you fixed. The model looks great, production still breaks. Sample real traffic.
Single-score tyranny. An overall mean of 0.82 hides that finance queries dropped from 0.9 to 0.5. Always report per-slice.
Gaming. Engineers optimize the eval instead of the product. Refresh the dataset monthly.

When NOT to invest heavily

If you have fewer than 100 daily users and the product is pre-PMF, building a full eval harness is premature. Ship, watch real sessions, and write evals from the first complaints.