step 13 · ship · production

Evaluation in production

A/B testing prompts, drift detection, golden-set regression. How to know your model got worse before users do.

evaluationproduction

The eval harness from step 04 ran against a fixed test set, on demand. That’s the offline mode — useful before you ship, useless after. Once your service is live, three new things happen:

Real users ask questions you didn’t anticipate. Your test set has 50 questions; users have asked 50,000 questions in the first week, and the long tail looks nothing like your test set.
Inputs drift. The questions users ask in week 8 are different from week 1. Your eval scores at deploy time say nothing about how the model is doing on this week’s traffic.
Things change underneath you. A vLLM upgrade, a model swap, a prompt template tweak in someone else’s PR — any of these can quietly degrade quality. Without prod evals, you find out from a Twitter complaint.

Production evals close the loop. The eval harness from step 04 is half the answer; this step is the other half.

Four things you actually need

Most “production AI eval” articles list 30 things and you do none of them. Here are the four that matter, ranked by ROI:

Golden-set regression on every deploy. A small (50–500), curated set of (input, expected, grader) cases that runs as part of CI. If quality drops on this set, the deploy fails.
A/B prompt comparison with statistical rigor. When you change a prompt, you want a quick “is the new one actually better.” Same task set, two prompts, paired comparison, p-value.
Drift detection on live traffic. Score a sample of live requests with an LLM-as-judge against a quality rubric. Alert if the rolling mean drops.
Thumbs-down to eval pipeline. When a user gives a bad rating, capture the trace and add it (with a reviewed expected output) to your golden set. Your test set grows automatically with the failure modes you actually see.

We’ll build all four. None individually is more than ~50 lines.

Setup

The eval pipeline reuses the harness from step 04 — TaskCase, grade_rules, grade_judge. We’ll add three new dependencies for statistics:

uv add scipy

Open the new file:

# stack/eval_prod.py
from __future__ import annotations
import json
import statistics
import time
from dataclasses import dataclass, field
from pathlib import Path
from typing import Callable

import scipy.stats

from stack.eval import TaskCase, grade_rules, grade_judge
from stack.llm import LLM

Golden-set regression

@dataclass
class RegressionResult:
    passed: int
    failed: int
    cases: list[dict] = field(default_factory=list)
    pass_rate: float = 0.0

    def is_passing(self, threshold: float = 0.9) -> bool:
        return self.pass_rate >= threshold


def run_regression(
    cases: list[TaskCase],
    runner: Callable[[str], str],
    grader: Callable[[TaskCase, str], tuple[bool, str]] = grade_rules,
) -> RegressionResult:
    """Run a fixed set of cases and report pass/fail."""
    passed = 0
    failed = 0
    out_cases = []
    for case in cases:
        actual = runner(case.input)
        ok, reason = grader(case, actual)
        out_cases.append({
            "id": case.id, "ok": ok,
            "input": case.input[:200],
            "actual": actual[:300],
            "reason": reason,
        })
        if ok:
            passed += 1
        else:
            failed += 1
    total = passed + failed
    return RegressionResult(
        passed=passed, failed=failed,
        cases=out_cases,
        pass_rate=passed / total if total else 0.0,
    )

Wire it to your live service in CI:

# scripts/regression_test.py
import sys
import httpx
from stack.eval_prod import run_regression
from stack.eval import load_cases


def runner_via_api(text: str) -> str:
    """Hit the live service exactly as a real user would."""
    r = httpx.post(
        "http://localhost:8000/v1/chat/completions",
        headers={"Authorization": "Bearer ${API_KEY}"},
        json={"messages": [{"role": "user", "content": text}]},
        timeout=60,
    )
    r.raise_for_status()
    return r.json()["choices"][0]["message"]["content"]


if __name__ == "__main__":
    cases = load_cases("evals/golden.jsonl")
    result = run_regression(cases, runner_via_api)
    print(f"pass rate: {result.pass_rate:.1%} ({result.passed}/{result.passed + result.failed})")
    for c in result.cases:
        if not c["ok"]:
            print(f"  FAIL [{c['id']}]: {c['reason']}")
    sys.exit(0 if result.is_passing(0.9) else 1)

In your CI config:

# .github/workflows/deploy.yml
- name: Boot service
  run: docker compose up -d
- name: Regression
  run: uv run python scripts/regression_test.py

Exit code drives the deploy. Pass rate below 90% blocks the deploy. You catch the obvious regressions before users do.

non-determinism is a regression-suite killer

LLM outputs are non-deterministic by default. A test that passes today might fail tomorrow with the same input — not because the model got worse, but because temperature 0.7 sampled a slightly different completion.

Two mitigations:

Run the suite at temperature 0.0 for regression. Greedy decoding is deterministic for the same model + prompt. You’re testing the prompt, not the sampling distribution.
Use rule-based or schema-based graders, not LLM-as-judge, for the golden set. Judges introduce their own variance; rules are stable. Save LLM-as-judge for drift detection where statistical sampling absorbs the variance.

Don’t skip greedy + rules. The pain of a flaky test that fails on one CI run in 20 will cost you weeks of trust in your test suite.

A/B prompt testing

The hardest thing about A/B prompt comparison isn’t the code; it’s having the discipline to do it before shipping every prompt change. Most teams eyeball a few examples, declare the new prompt better, and ship. Then quality drops 5% on production traffic. Build the harness, run it, look at the p-value.

# stack/eval_prod.py (continued)
@dataclass
class ABResult:
    prompt_a_mean: float
    prompt_b_mean: float
    delta: float                # B - A
    p_value: float
    significant: bool           # True if p < 0.05 and |delta| > min_effect
    n_cases: int


def ab_test_prompts(
    cases: list[TaskCase],
    runner_a: Callable[[str], str],
    runner_b: Callable[[str], str],
    judge_llm: LLM,
    min_effect: float = 0.05,
) -> ABResult:
    """Paired A/B test of two prompt variants on the same cases.

    `runner_a` and `runner_b` are functions that wrap the two prompts
    and return the model's answer. We grade with an LLM-as-judge for
    open-ended tasks; swap to grade_rules for closed ones.
    """
    scores_a, scores_b = [], []
    for case in cases:
        ans_a = runner_a(case.input)
        ans_b = runner_b(case.input)

        # Side-by-side judging: same judge sees both, scores 1–5.
        score_a, _ = grade_judge(judge_llm, case, ans_a)
        score_b, _ = grade_judge(judge_llm, case, ans_b)
        scores_a.append(score_a)
        scores_b.append(score_b)

    mean_a = statistics.mean(scores_a)
    mean_b = statistics.mean(scores_b)
    delta = mean_b - mean_a

    # Paired t-test: same cases, two treatments.
    _, p_value = scipy.stats.ttest_rel(scores_b, scores_a)

    return ABResult(
        prompt_a_mean=mean_a, prompt_b_mean=mean_b,
        delta=delta, p_value=p_value,
        significant=(p_value < 0.05 and abs(delta) >= min_effect),
        n_cases=len(cases),
    )

Run it:

ab = ab_test_prompts(
    cases=load_cases("evals/golden.jsonl"),
    runner_a=lambda q: chat_with_prompt(PROMPT_V1, q),
    runner_b=lambda q: chat_with_prompt(PROMPT_V2, q),
    judge_llm=LLM(),
)
print(f"Prompt A mean: {ab.prompt_a_mean:.2f}")
print(f"Prompt B mean: {ab.prompt_b_mean:.2f}")
print(f"Delta:         {ab.delta:+.2f}  (p={ab.p_value:.3f})")
print("VERDICT:", "B wins" if ab.significant and ab.delta > 0 else
                  "A wins" if ab.significant and ab.delta < 0 else
                  "no significant difference")

Two principles to internalize:

Paired test, not unpaired. The same case sees both prompts. Variance from “this case is hard for any prompt” cancels out. You’ll see significance with 1/3 the sample size.
Effect size and p-value. A 0.01 improvement at p=0.001 is statistically real but practically nothing. Set a min_effect floor (e.g. 0.05 = 5% of the score range) and reject changes below it even if they’re “significant.”

Drift detection

Drift detection is the live-traffic version of regression. Instead of a fixed test set, you sample N requests per day, judge them, and watch the rolling mean.

# stack/eval_prod.py (continued)
DRIFT_RUBRIC = """\
You are a quality judge. Score this assistant response on a 1–5 scale:

5 = Excellent. Directly answers the user, accurate, well-formatted.
4 = Good. Answers the user with minor flaws.
3 = Acceptable. Mostly addresses the question, some issues.
2 = Poor. Misses key parts of the question or has notable errors.
1 = Bad. Wrong, off-topic, or unhelpful.

Output ONLY a JSON object: {"score": <int>, "reason": "<short reason>"}
"""


@dataclass
class TraceSample:
    """One captured trace with its query and response."""
    trace_id: str
    timestamp: float
    user_query: str
    assistant_response: str


def judge_sample(judge_llm: LLM, sample: TraceSample) -> tuple[int, str]:
    """LLM-as-judge over a single live trace sample. Returns (score, reason)."""
    response = judge_llm.chat(
        messages=[
            {"role": "system", "content": DRIFT_RUBRIC},
            {"role": "user", "content":
                f"USER ASKED: {sample.user_query}\n\n"
                f"ASSISTANT REPLIED: {sample.assistant_response}"},
        ],
        temperature=0.0,
    )
    text = response["choices"][0]["message"]["content"] or "{}"
    try:
        obj = json.loads(text)
        return int(obj["score"]), obj.get("reason", "")
    except Exception:
        return 0, "judge parse error"


@dataclass
class DriftReport:
    window_start: float
    window_end: float
    n_samples: int
    mean_score: float
    p10_score: float        # 10th percentile — surfaces tail regressions
    bad_share: float        # fraction with score <= 2
    delta_vs_baseline: float | None = None
    alert: bool = False


def compute_drift(
    samples: list[TraceSample],
    judge_llm: LLM,
    baseline_mean: float | None = None,
    alert_threshold: float = 0.2,
) -> DriftReport:
    """Score N live samples and produce a report.

    Alert if the mean drops by more than `alert_threshold` vs baseline.
    """
    scores = [judge_sample(judge_llm, s)[0] for s in samples]
    scores = [s for s in scores if s > 0]   # drop parse errors

    mean_score = statistics.mean(scores) if scores else 0.0
    p10 = statistics.quantiles(scores, n=10)[0] if len(scores) >= 10 else min(scores)
    bad_share = sum(1 for s in scores if s <= 2) / len(scores) if scores else 0.0

    delta = None
    alert = False
    if baseline_mean is not None:
        delta = mean_score - baseline_mean
        alert = delta < -alert_threshold

    return DriftReport(
        window_start=min(s.timestamp for s in samples),
        window_end=max(s.timestamp for s in samples),
        n_samples=len(scores), mean_score=mean_score,
        p10_score=p10, bad_share=bad_share,
        delta_vs_baseline=delta, alert=alert,
    )

load_traces_from_phoenix() would query the Phoenix API for traces in a time window; we’ll skip the implementation here (5 lines of httpx).

Schedule it:

# scripts/daily_drift.py
import datetime as dt
from stack.eval_prod import compute_drift, load_traces_from_phoenix
from stack.llm import LLM


if __name__ == "__main__":
    samples = load_traces_from_phoenix(
        since=dt.datetime.utcnow() - dt.timedelta(days=1),
        sample_size=200,
    )
    report = compute_drift(samples, LLM(), baseline_mean=4.1)
    print(json.dumps({
        "n": report.n_samples,
        "mean": report.mean_score,
        "p10": report.p10_score,
        "bad_share": report.bad_share,
        "delta": report.delta_vs_baseline,
        "alert": report.alert,
    }, indent=2))
    if report.alert:
        # Slack-webhook, PagerDuty, whatever you use.
        notify_team(f"Quality drift: mean dropped to {report.mean_score:.2f}")

Cron at 2 a.m., 200 samples a day, baseline anchored to your golden-set score at deploy time. Quality drops 0.2 points and you get paged before users complain.

Feedback to eval pipeline

The last piece. When users thumb-down a response, capture it as a candidate eval case:

# stack/server.py — modified to capture feedback
@app.post("/v1/feedback")
async def feedback(
    payload: FeedbackPayload,
    api_key: str = Depends(verify_key),
) -> dict:
    """Capture user feedback. Routes thumbs-down to the eval candidate queue."""
    record = {
        "trace_id": payload.trace_id,
        "rating": payload.rating,
        "comment": payload.comment,
        "ts": int(time.time() * 1000),
    }
    # Append to local JSONL for now; in prod this is a queue / DB.
    Path("evals/feedback.jsonl").open("a").write(json.dumps(record) + "\n")
    return {"ok": True}

A short script promotes thumbs-down into golden-set candidates:

# scripts/feedback_to_eval.py
import json
from pathlib import Path
from stack.eval_prod import load_trace
from stack.eval import TaskCase


for line in Path("evals/feedback.jsonl").open():
    record = json.loads(line)
    if record["rating"] != "down":
        continue
    trace = load_trace(record["trace_id"])
    case = TaskCase(
        id=f"feedback-{record['trace_id'][:8]}",
        input=trace.user_query,
        expected="<TODO: human review needed>",
        grader="rules",   # or "judge"
    )
    Path("evals/candidates.jsonl").open("a").write(case.to_json() + "\n")

A human reviews candidates.jsonl weekly, fills in expected outputs, promotes the cases into golden.jsonl. Now your test suite grows with your real failure modes — not the failure modes you imagined at deploy time.

Cross-references

LLM Evaluation demo — the offline harness from step 04, in interactive form
LLM as Judge demo — how a judge actually scores responses
Eval Harness article — the theory side
Step 12 — the trace data this article consumes

What we did and didn’t do

What we did:

A regression runner that gates deploys on a 90% pass rate
A paired A/B harness with a t-test and an effect-size threshold
A drift detector that scores live samples and alerts on rolling-mean drops
A feedback collector that promotes thumbs-down into eval candidates

What we didn’t:

Continuous online learning. Closing the loop by fine-tuning on bad cases. Powerful and risky; defer until you have rock-solid evals (and step 16 covers when this is worth it).
Multi-judge ensembling. Average over GPT-4o, Claude, and a local model to reduce single-judge bias. ~3× cost; worth it only when single-judge scores stop tracking human ratings.
Stratified sampling. Sample drift detection by user segment, language, or topic. Important when your traffic is heterogeneous; one extra GROUP BY in the trace query.
Synthetic data generation for eval. Have a model generate test cases. Useful for getting started; less useful once you have real user data.

Step 14 is cost and latency tuning — caches, batching, KV reuse, quantization, and the four levers behind every serving optimization. The eval pipeline you just built tells you whether quality stays put while you make the system faster and cheaper. That’s not a coincidence; you build evals before you optimize specifically so cost/latency improvements don’t quietly tank quality.