AIStackWatch
Back to wiki

Evals for LLM Applications

An eval is a test case for an LLM app. You collect inputs that matter, define a scorer that checks the output, and run the pair against each candidate prompt or model. Without evals you are shipping based on vibes.

The shape of an eval

  • Dataset. A frozen list of inputs — user queries, documents, tool histories. Typically 20 for fast smoke tests, 200+ for release gates.
  • Scorer. A function from (input, output, expected) to a number in [0, 1]. Can be deterministic (regex match, JSON validity), LLM-judged (rubric), or human-labeled.
  • Run. Execute the system under test across every dataset row, aggregate scores, diff vs. the previous run.

Types of scorers

  • Exact match / regex — works for classification, structured extraction.
  • Reference-based — compare against a gold answer with embedding similarity or ROUGE. Cheap; brittle on open-ended tasks.
  • LLM-as-judge — ask another LLM to score against a rubric. Scales well; prompt the judge carefully or it becomes the bottleneck.
  • Human review — slow and expensive, still the ground truth for judgment-heavy tasks.

Running evals in CI

Gate PRs on eval regressions. A good setup:

  1. Sample 30 fast cases that run in under 60 seconds.
  2. Fail the build if the win-rate vs. main drops below 95%.
  3. Run the full 500-case suite nightly.
  4. Block release if any critical slice (top-10 customers, top-5 intents) regresses.

Failure modes

  • Judge drift. Your LLM judge changes scoring behavior after a model upgrade. Freeze the judge model version.
  • Dataset contamination. Your team writes eval cases from memory of bugs you fixed. The model looks great, production still breaks. Sample real traffic.
  • Single-score tyranny. An overall mean of 0.82 hides that finance queries dropped from 0.9 to 0.5. Always report per-slice.
  • Gaming. Engineers optimize the eval instead of the product. Refresh the dataset monthly.

When NOT to invest heavily

If you have fewer than 100 daily users and the product is pre-PMF, building a full eval harness is premature. Ship, watch real sessions, and write evals from the first complaints.