demo

Evals at the speed of inference

LLM-as-judge: the pattern where one model scores another model's outputs. Tune the rubric weights and watch the winner change. The technique behind every fast eval loop in production.

The aggregation math

# for each candidate answer:
total_score = Σᵢ wᵢ · sᵢ
            ─────────────
                Σᵢ wᵢ

# wᵢ = your weight for criterion i
# sᵢ = the judge's 1–5 score on criterion i
# winner = argmax over candidates

Try this — predict before you click

  1. Pick the factual sample. Set "factuality" weight to 5, all others to 1. Predict: the answer with the highest factuality score wins, even if it's terse. Now flip — set "factuality" to 0 and "helpfulness" to 5. Predict: the longer, friendlier-but-less-accurate answer suddenly wins. Same answers, different rubric, different winner.
  2. Pick the ambiguous sample. Try equal weights (1, 1, 1, 1). Predict: the winner is decided by tiny score differences (~0.1). With equal weights and close scores, the judge is essentially flipping a coin — this is why production rubrics hand-tune weights to express what they actually care about.
  3. Imagine a "verbosity bias" judge — it secretly rewards longer answers. Predict: across many samples, the more verbose candidate wins more often than its content deserves. This is a real, measurable bias in GPT-4-as-judge benchmarks. Production systems normalize answer length, swap candidate positions to detect position bias, and run multiple judges to vote.
  4. Compare the creative sample with weights all on "tone" vs. all on "factuality". Predict: factuality scores are similar (creative writing has few facts), but tone scores diverge sharply. Pick a rubric that matches your task — factuality dominance is wrong for creative tasks; tone dominance is wrong for factual tasks.

Anchored to 13-production/evaluation-and-benchmarks. Code-side: /ship/13 — evaluation in production.