demo
Evals at the speed of inference
LLM-as-judge: the pattern where one model scores another model's outputs. Tune the rubric weights and watch the winner change. The technique behind every fast eval loop in production.
The aggregation math
# for each candidate answer:
total_score = Σᵢ wᵢ · sᵢ
─────────────
Σᵢ wᵢ
# wᵢ = your weight for criterion i
# sᵢ = the judge's 1–5 score on criterion i
# winner = argmax over candidates Try this — predict before you click
- Pick the factual sample. Set "factuality" weight to 5, all others to 1. Predict: the answer with the highest factuality score wins, even if it's terse. Now flip — set "factuality" to 0 and "helpfulness" to 5. Predict: the longer, friendlier-but-less-accurate answer suddenly wins. Same answers, different rubric, different winner.
- Pick the ambiguous sample. Try equal weights (1, 1, 1, 1). Predict: the winner is decided by tiny score differences (~0.1). With equal weights and close scores, the judge is essentially flipping a coin — this is why production rubrics hand-tune weights to express what they actually care about.
- Imagine a "verbosity bias" judge — it secretly rewards longer answers. Predict: across many samples, the more verbose candidate wins more often than its content deserves. This is a real, measurable bias in GPT-4-as-judge benchmarks. Production systems normalize answer length, swap candidate positions to detect position bias, and run multiple judges to vote.
- Compare the creative sample with weights all on "tone" vs. all on "factuality". Predict: factuality scores are similar (creative writing has few facts), but tone scores diverge sharply. Pick a rubric that matches your task — factuality dominance is wrong for creative tasks; tone dominance is wrong for factual tasks.
Anchored to 13-production/evaluation-and-benchmarks.
Code-side: /ship/13 — evaluation in production.