step 04 · ship · foundations

Build an eval harness

lm-eval-harness for benchmarks + a custom task-specific eval you write yourself. The 'is it any good' question, answered programmatically.

evaluation

You have a model running. Now: is it any good?

That’s a deceptively hard question. There’s no one number — different evals measure different capabilities, all of them are partial, and most of the published benchmark scores you’ve seen don’t predict downstream task performance well. Production teams know this; they evaluate on two axes simultaneously: standardized academic benchmarks (for sanity-checking the base model) and task-specific evals (for what you actually ship).

By the end of this step you’ll have both running, and a single command that produces a report you can rerun after every prompt change, model swap, or fine-tune.

What this is and isn’t

Three categories of eval, in roughly increasing relevance to production:

Static benchmarks (MMLU, HellaSwag, GSM8K, TruthfulQA): widely used for cross-model comparison. Easy to game; not always predictive of your task. Useful for “is this model fundamentally broken or competitive?”
Task-specific evals: you write them. Test the actual behaviors your application depends on. These are the ones that matter.
Production evals: run continuously against live traffic. Step 13 covers them; out of scope here.

This step covers (1) and (2). We’ll wire (3) into the foundation we lay here.

Install lm-evaluation-harness

EleutherAI’s lm-evaluation-harness is the de-facto standard for academic benchmarks. It implements ~150 tasks against any HuggingFace-compatible or OpenAI-compatible endpoint.

uv add lm-eval

Verify the install:

uv run lm_eval --tasks list 2>&1 | head -30

You should see a long list of task names (“mmlu”, “hellaswag”, “gsm8k”, “truthfulqa_mc1”, “arc_challenge”, and ~150 more). If the install worked, those are all callable.

Run MMLU on your model

MMLU (Massive Multitask Language Understanding) tests knowledge across 57 academic subjects with multiple-choice questions. The prototypical “model knowledge” benchmark.

lm-eval supports OpenAI-compatible endpoints via the local-chat-completions model type. Point it at your vLLM (or Ollama) instance:

uv run lm_eval \
  --model local-chat-completions \
  --model_args model=meta-llama/Llama-3.1-8B-Instruct,base_url=http://localhost:8000/v1/chat/completions,num_concurrent=8 \
  --tasks mmlu \
  --num_fewshot 5 \
  --output_path eval-results/mmlu.json

Quick translation:

--model local-chat-completions: tells lm-eval we’re hitting an OpenAI-compat endpoint, not loading weights
--model_args: where to find it; num_concurrent=8 parallelizes requests against vLLM (continuous batching pays off here)
--tasks mmlu: run the full MMLU benchmark (14k questions across 57 subjects)
--num_fewshot 5: give the model 5 examples in context before each question (the standard MMLU configuration)

Run time: ~20–40 minutes against vLLM on a single GPU at num_concurrent=8. Considerably longer against Ollama (it doesn’t batch well; see step 03).

Expected output (final report):

|      Tasks       |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|------------------|------:|------|-----:|------|---|-----:|---|-----:|
|mmlu              |      2|none  |     5|acc   |↑  |0.6824|±  |0.0036|
| - humanities     |      2|none  |     5|acc   |↑  |0.6298|±  |0.0066|
| - social sciences|      2|none  |     5|acc   |↑  |0.7754|±  |0.0073|
| - stem           |      2|none  |     5|acc   |↑  |0.5926|±  |0.0084|
| - other          |      2|none  |     5|acc   |↑  |0.7417|±  |0.0075|

0.6824 on MMLU is roughly what Llama-3.1-8B-Instruct should score. If you got below 0.5, something’s wrong (model misconfigured, wrong system prompt, broken fewshot). If you got above 0.75, double-check — that’s GPT-4 territory.

Run a faster sanity check

MMLU is comprehensive but slow. For iteration during development, run smaller benchmarks:

# HellaSwag — 10K commonsense reasoning questions, ~5 minutes against vLLM
uv run lm_eval \
  --model local-chat-completions \
  --model_args model=meta-llama/Llama-3.1-8B-Instruct,base_url=http://localhost:8000/v1/chat/completions,num_concurrent=8 \
  --tasks hellaswag \
  --num_fewshot 0 \
  --output_path eval-results/hellaswag.json

# GSM8K — 1.3K grade-school math word problems, ~10 minutes
uv run lm_eval \
  --model local-chat-completions \
  --model_args model=meta-llama/Llama-3.1-8B-Instruct,base_url=http://localhost:8000/v1/chat/completions,num_concurrent=8 \
  --tasks gsm8k \
  --num_fewshot 8 \
  --output_path eval-results/gsm8k.json

Expected scores for Llama-3.1-8B-Instruct (rough):

Benchmark	What it measures	Llama-3.1-8B
MMLU (5-shot)	broad knowledge	~0.68
HellaSwag (0-shot)	commonsense reasoning	~0.78
GSM8K (8-shot, CoT)	math word problems	~0.55
ARC-Challenge	science questions	~0.55
TruthfulQA-MC1	resistance to common misconceptions	~0.45

These are sanity bounds. If you score wildly higher, you’re probably leaking; wildly lower means a config issue.

The eval that actually matters

Now the part academic benchmarks don’t cover: the task you ship. Suppose your application is a TL;DR summarizer for technical articles. MMLU doesn’t tell you whether your model can do that. You have to write a custom eval.

The pattern: a small set of curated (input, expected, judging_criteria) triples and a script that grades the model’s output. Two grading styles:

Rule-based (regex, length checks, schema validation): fast, deterministic, works for structured outputs
LLM-as-judge (a stronger model rates the output against a rubric): expensive, noisy, but scales to subjective qualities

Production evals use both. Let me show you the shape.

Write a custom task eval

A sketch — stack/eval.py. We’ll target a TL;DR summarization task.

# stack/eval.py
from __future__ import annotations
import json
import re
import statistics
from dataclasses import dataclass
from pathlib import Path
import httpx


# Point at whichever backend is running. Ollama on 11434, vLLM on 8000.
LLM_URL = "http://localhost:8000/v1/chat/completions"
LLM_MODEL = "meta-llama/Llama-3.1-8B-Instruct"

# (Optional) a stronger judge for LLM-as-judge scoring. Use the same
# local model for consistency, or point at OpenAI/Anthropic for higher
# fidelity. Cost matters; the local-judge approach is cheap and good
# enough for development iteration.
JUDGE_URL = LLM_URL
JUDGE_MODEL = LLM_MODEL


@dataclass
class TaskCase:
    """One test case for the summarization task."""
    id: str
    input: str
    must_mention: list[str]      # rule-based: words/phrases the summary must contain
    must_not_mention: list[str]  # rule-based: words it must NOT contain
    rubric: str                  # human-readable description for LLM judge


CASES: list[TaskCase] = [
    TaskCase(
        id="solar-system",
        input=(
            "The solar system formed approximately 4.6 billion years ago "
            "from a giant interstellar molecular cloud. The Sun, which "
            "contains 99.86% of the system's mass, is orbited by eight "
            "planets, several dwarf planets including Pluto, and countless "
            "asteroids and comets. The four inner planets are rocky; the "
            "four outer planets are gaseous or icy."
        ),
        must_mention=["4.6 billion", "Sun", "planets"],
        must_not_mention=["controversial", "scientists disagree"],
        rubric="A two-sentence summary covering the age, the Sun's mass, and the planet structure.",
    ),
    TaskCase(
        id="kv-cache",
        input=(
            "KV caching is a technique used in transformer inference to "
            "avoid recomputing the keys and values of previously seen "
            "tokens. During autoregressive generation, only the new "
            "token's K and V need to be computed; the cached entries "
            "for earlier tokens are reused. This drops per-token cost "
            "from O(n²) to O(n) where n is sequence length."
        ),
        must_mention=["KV", "cache"],
        must_not_mention=["GPU memory", "Apple Silicon"],
        rubric="One sentence explaining what KV caching does and why it speeds up inference.",
    ),
    # Add 10–50 more for a real eval set.
]


# ─── grading ────────────────────────────────────────────────────────


def grade_rules(case: TaskCase, output: str) -> tuple[float, list[str]]:
    """Rule-based grading. Returns (score in [0, 1], list of issues)."""
    issues = []
    score = 1.0

    # Required mentions
    missing = [m for m in case.must_mention if m.lower() not in output.lower()]
    if missing:
        issues.append(f"missing required: {missing}")
        score -= 0.5 * (len(missing) / max(1, len(case.must_mention)))

    # Forbidden mentions
    present_forbidden = [m for m in case.must_not_mention if m.lower() in output.lower()]
    if present_forbidden:
        issues.append(f"contains forbidden: {present_forbidden}")
        score -= 0.5 * (len(present_forbidden) / max(1, len(case.must_not_mention)))

    # Length sanity (TL;DRs shouldn't be longer than the input)
    if len(output) > len(case.input):
        issues.append("summary longer than input")
        score -= 0.2

    return max(0.0, score), issues


JUDGE_PROMPT = """You are an evaluator. Read the expected behavior and the actual output, then score from 1 to 5.

Expected behavior: {rubric}

Actual output:
\"\"\"
{output}
\"\"\"

Respond with ONLY a single integer from 1 to 5. No explanation."""


def grade_judge(case: TaskCase, output: str) -> int:
    """LLM-as-judge scoring. Returns int in [1, 5]."""
    prompt = JUDGE_PROMPT.format(rubric=case.rubric, output=output)
    with httpx.Client(timeout=60.0) as client:
        r = client.post(
            JUDGE_URL,
            json={
                "model": JUDGE_MODEL,
                "messages": [{"role": "user", "content": prompt}],
                "temperature": 0,  # determinism for judging
                "max_tokens": 5,
            },
        )
        r.raise_for_status()
        text = r.json()["choices"][0]["message"]["content"].strip()
    m = re.search(r"[1-5]", text)
    return int(m.group()) if m else 0


# ─── runner ─────────────────────────────────────────────────────────


SYSTEM_PROMPT = "You write concise TL;DR summaries. Be brief, accurate, and factual. Two sentences maximum."


def run_one(case: TaskCase) -> dict:
    with httpx.Client(timeout=60.0) as client:
        r = client.post(
            LLM_URL,
            json={
                "model": LLM_MODEL,
                "messages": [
                    {"role": "system", "content": SYSTEM_PROMPT},
                    {"role": "user", "content": f"Summarize:\n\n{case.input}"},
                ],
                "temperature": 0,
                "max_tokens": 150,
            },
        )
        r.raise_for_status()
        output = r.json()["choices"][0]["message"]["content"].strip()

    rule_score, rule_issues = grade_rules(case, output)
    judge_score = grade_judge(case, output)
    return {
        "id": case.id,
        "output": output,
        "rule_score": rule_score,
        "rule_issues": rule_issues,
        "judge_score": judge_score,
    }


def run_all() -> dict:
    results = [run_one(case) for case in CASES]

    rule_scores = [r["rule_score"] for r in results]
    judge_scores = [r["judge_score"] for r in results]

    summary = {
        "n_cases": len(results),
        "rule_avg": statistics.mean(rule_scores),
        "rule_pass_rate": sum(1 for s in rule_scores if s >= 0.8) / len(rule_scores),
        "judge_avg": statistics.mean(judge_scores),
        "judge_pass_rate": sum(1 for s in judge_scores if s >= 4) / len(judge_scores),
    }
    return {"summary": summary, "results": results}


if __name__ == "__main__":
    Path("eval-results").mkdir(exist_ok=True)
    report = run_all()

    print(f"\n── eval summary ──")
    s = report["summary"]
    print(f"  cases:           {s['n_cases']}")
    print(f"  rule avg:        {s['rule_avg']:.2f} / 1.00")
    print(f"  rule pass rate:  {s['rule_pass_rate']:.0%}")
    print(f"  judge avg:       {s['judge_avg']:.2f} / 5.00")
    print(f"  judge pass rate: {s['judge_pass_rate']:.0%}")

    print(f"\n── per-case ──")
    for r in report["results"]:
        flag = "✓" if r["rule_score"] >= 0.8 and r["judge_score"] >= 4 else "✗"
        print(f"  {flag} {r['id']:20s}  rule={r['rule_score']:.2f}  judge={r['judge_score']}")
        if r["rule_issues"]:
            print(f"     ↳ {r['rule_issues']}")

    Path("eval-results/custom.json").write_text(json.dumps(report, indent=2))
    print(f"\nfull report → eval-results/custom.json")

Run it

uv run python -m stack.eval

Expected output (numbers vary; deterministic if temperature=0):

── eval summary ──
  cases:           2
  rule avg:        0.90 / 1.00
  rule pass rate:  100%
  judge avg:       4.50 / 5.00
  judge pass rate: 100%

── per-case ──
  ✓ solar-system          rule=0.80  judge=4
     ↳ ['summary longer than input']
  ✓ kv-cache              rule=1.00  judge=5

full report → eval-results/custom.json

Two cases is too small to be statistically meaningful. The point is the shape: build the eval against your real cases, expand to 50–500, run it after every change.

Common pitfalls

Eval set leakage. If your training data overlaps with your eval cases, scores look great and predict nothing. Hold out the eval set from the start; don’t iterate on it.

Single-temperature evals. Sampling at temperature=0 (greedy) gives one number; production runs at higher temperatures. Run your eval at temperature=0 for reproducibility and at your production temperature to see real-world variance.

Trusting one judge. LLM-as-judge has biases (preferring longer outputs, agreeing with itself, favoring its own model family). For high-stakes decisions: median of three different judges, or include a human spot-check.

Too few cases. Two examples is a test, not an eval. Aim for 30+ to start, 100+ for production-relevant signal.

Cross-references

LLM-as-Judge demo — interactive rubric scoring, makes the judging mechanic concrete with weighted criteria
Perplexity Calculator demo — what perplexity is and why we don’t use it for chat-style eval (it isn’t quality, it’s predictability)
Calibration Lab demo — the “is the model confident when it’s right” property; matters once you wire confidence-aware retrieval
/build step 13 (Evaluate honestly) — the same three-lens pattern (perplexity / generation / LLM-as-judge), one level deeper

What we did and didn’t do

What we did:

Installed and ran lm-eval-harness against MMLU, HellaSwag, GSM8K
Wrote a custom task eval combining rule-based and LLM-as-judge grading
Stood up a deterministic, reproducible eval pipeline you can rerun in one command
Used the model itself as the judge for cheap iteration (and noted when you’d want a stronger external judge)

What we didn’t:

Wire evals to CI. Production teams gate model upgrades behind passing the eval suite. We’ll touch on this in step 13 (eval in production) and step 15 (deployment).
Implement pairwise comparison (Arena-style). “Which output is better, A or B?” judging is the gold standard for subjective quality. Adds bookkeeping; orthogonal to the basics here.
Statistical significance testing. With 50+ cases you’d want bootstrap confidence intervals on the judge scores. Production-tier; out of scope.
Prompt-injection-style eval. Adversarial inputs that try to break your system prompt. Important for safety-critical apps; covered in step 12 alongside observability.

Step 05 wraps everything we have so far in a FastAPI service. The Ollama / vLLM client, eval helpers, and a streaming /v1/chat/completions endpoint of your own — with auth, structured logging, and a health check. After step 05 you’ll have a self-hosted LLM-as-a-service you could expose behind a real URL.