Evaluation & Benchmarks

The single biggest gap between prototype and production AI is evaluation discipline. A demo-quality model with no eval is a coin flip in production. With evals, it’s a tracked, improvable system.

Three layers of eval

1. Component evals

Test individual pieces (retrieval, classification, extraction) in isolation. Already covered in earlier stages.

2. End-to-end task evals

Given a user input, did the system produce the right output?

3. Production evals

On real traffic, is the system maintaining quality over time?

A serious AI product has all three.

Building an eval set

The single highest-ROI investment in AI quality.

What it should contain:

50–500 representative cases — real or carefully simulated.
Inputs: messages, queries, prompts, conversations as users actually send them.
Expected outputs or accept criteria: what counts as “good”?
Metadata: difficulty, topic, expected behavior class.
Adversarial / edge cases: 10–20% of the set.
“Should refuse” cases: explicitly test boundaries.

Build it iteratively. Start with 30 cases; add 10 more weekly as you discover failure modes.

Don’t skip this. “We’ll add evals later” means “we’ll regret it later.”

What “good” looks like

For each case, define what counts as success:

Exact match: rare in LLM outputs; brittle.
Field-level match: for structured output.
Substring containment: “the answer must contain ‘Paris’”.
Embedding similarity to a reference: handles paraphrasing.
Rubric-based LLM-as-judge: most flexible.
Human review: gold standard, expensive.

For most production use, a rubric-based LLM-as-judge with human spot-checks balances cost and quality.

LLM-as-judge

Use a strong LLM to score model outputs.

Rubric:
- Helpful (1–5)
- Accurate (1–5)
- Follows format (yes/no)
- Cites sources (yes/no)

Input: ...
Output: ...

Score each dimension and explain.

Pros: fast, scalable, cheap. Cons: judges have biases (positional, length, style, model-family bias).

Mitigations:

Calibrate the judge: hand-label 100 examples; check agreement with judge.
Pairwise comparison instead of absolute scoring (often more reliable).
Multiple judges + majority vote.
Diverse judge models (don’t always use GPT-4 to judge GPT-4).

Modern frameworks (Inspect, promptfoo, RAGAs, DeepEval, Braintrust) bake in judge calibration tools.

Standard benchmarks

For general capability evaluation:

Benchmark	Tests
MMLU, MMLU-Pro	Broad knowledge, multiple-choice
HellaSwag	Common-sense reasoning
GSM8K, MATH	Math word problems
HumanEval, MBPP, BigCodeBench	Code generation
AIME 2024/2025	Olympiad math
GPQA	Graduate-level science
IFEval	Instruction following
MT-Bench	Multi-turn chat quality
AlpacaEval	Open-ended chat preference
HumanEval-V	Vision QA
MMMU, MathVista	Multimodal
MTEB	Text embeddings
RULER, NoLiMa	Long-context retrieval
BFCL, ToolBench	Tool use
SWE-bench Verified	Real-world software engineering

These tell you about the model, not your application. Use them to pick a model, not to validate your product.

Application-specific evals

What matters for your product:

Task success rate: did the user get what they wanted?
Faithfulness (for RAG): does the answer follow from sources?
Safety: refusal rates on harmful prompts; non-refusal on benign ones.
Latency: p50, p95, p99.
Cost per success: dollars per resolved query.
Edge case behavior: what about empty inputs, malformed inputs, adversarial inputs?

Online evals

Production traffic is your richest eval data. Patterns:

Sampling

Each day, randomly sample 1% of prod traffic; run LLM-judge eval offline; track metrics.

Implicit signals

User clicks, retries, edits, abandonments.
Conversation length (long is sometimes bad — couldn’t get answer).
Thumbs up/down.
Time to resolution.

A/B testing

Route 5% of traffic to a new prompt or model. Compare metrics on equivalent populations.

Implicit signals are noisy but free. Explicit feedback (thumbs up/down) is cleaner but rare. Use both.

Regression testing

Before you ship a change:

Run the eval set on the current production version.
Run on the candidate change.
Compare per-case and aggregate.
If aggregate improves but specific cases regress, decide deliberately.

Do this in CI. Block deploys that regress beyond a threshold.

Continuous eval

The eval set drifts. Production traffic shifts. New failure modes appear.

Refresh eval set monthly with new production-derived cases.
Version your eval set. When you change it, mark a new baseline.
Don’t game your eval. If you tune to specific cases, you’ll regress unseen ones.

Eval-driven development

The discipline:

Write an eval for a desired behavior before changing the system.
Make the eval pass.
Make all other evals continue to pass.
Ship.

It’s TDD for AI. Slows you down for the first few features; saves you 10× later.

Frameworks

promptfoo: config-driven evals, easy CI integration.
DeepEval: LLM eval framework with many built-in metrics.
Inspect (UK AISI): research-grade, great for RL/RLHF too.
Braintrust: SaaS eval + observability.
LangSmith / Langfuse / Phoenix / Helicone: tracing + eval combined.
RAGAs: RAG-specific.
lm-evaluation-harness: standard benchmarks.
mteb: embedding evals.

For most teams: pick one framework and stick with it.

Common eval failures

No held-out set: every change “improves” because you tune on the eval.
Stale eval: the system changed; your eval didn’t.
Wrong metric: optimizing accuracy when latency is what matters.
Single-number reporting: a 92% average hides per-class disasters.
No confidence intervals: noise misread as improvement.
Over-trust in LLM-judge: judges are biased; calibrate.
No production sampling: eval set looks great; prod is dying.

Eval economics

Budget honestly:

Each eval run: 100s–1000s of LLM calls.
A judging pass: another 100s–1000s.
For a thoughtful PR review: $1–$10 in eval cost.

This is cheap compared to launching a regression. Don’t skip.

Watch it interactively

LLM-as-Judge — tune rubric weights, watch the winner change. Shows why the rubric, not the answers, often decides “which model is better.”
Confusion Matrix Lab — drag the threshold; watch precision, recall, F1, AUC trade live. The same dynamics power retrieval@k vs precision in RAG.
Calibration Lab — temperature scaling on real overconfident logits. Predict before clicking: drag T from 1.0 → 2.0 and watch ECE drop from ~12% to ~3%; push past 2.5 and ECE rises again from the other side. The U-shape is the lesson.

Build it in code

/ship/04 — build the eval harness — TaskCase, rule-based grading, LLM-as-judge grading, the basic bench script.
/ship/13 — evaluation in production — golden-set regression on every deploy, paired A/B prompt testing with a t-test, drift detection on live traffic, feedback-to-eval pipeline. The shape of an eval system that runs on schedule, not just on PRs.
/case-studies/01 — docs assistant — three-bucket refusal eval (out-of-scope / answerable / boundary) and the cite-correctness metric. Real eval framework on a real product.