step 13 · ship · production
Evaluation in production
A/B testing prompts, drift detection, golden-set regression. How to know your model got worse before users do.
The eval harness from step 04 ran against a fixed test set, on demand. That’s the offline mode — useful before you ship, useless after. Once your service is live, three new things happen:
- Real users ask questions you didn’t anticipate. Your test set has 50 questions; users have asked 50,000 questions in the first week, and the long tail looks nothing like your test set.
- Inputs drift. The questions users ask in week 8 are different from week 1. Your eval scores at deploy time say nothing about how the model is doing on this week’s traffic.
- Things change underneath you. A vLLM upgrade, a model swap, a prompt template tweak in someone else’s PR — any of these can quietly degrade quality. Without prod evals, you find out from a Twitter complaint.
Production evals close the loop. The eval harness from step 04 is half the answer; this step is the other half.
Four things you actually need
Most “production AI eval” articles list 30 things and you do none of them. Here are the four that matter, ranked by ROI:
- Golden-set regression on every deploy. A small (50–500), curated set of
(input, expected, grader)cases that runs as part of CI. If quality drops on this set, the deploy fails. - A/B prompt comparison with statistical rigor. When you change a prompt, you want a quick “is the new one actually better.” Same task set, two prompts, paired comparison, p-value.
- Drift detection on live traffic. Score a sample of live requests with an LLM-as-judge against a quality rubric. Alert if the rolling mean drops.
- Thumbs-down to eval pipeline. When a user gives a bad rating, capture the trace and add it (with a reviewed expected output) to your golden set. Your test set grows automatically with the failure modes you actually see.
We’ll build all four. None individually is more than ~50 lines.
Setup
The eval pipeline reuses the harness from step 04 — TaskCase, grade_rules, grade_judge. We’ll add three new dependencies for statistics:
uv add scipy
Open the new file:
# stack/eval_prod.py
from __future__ import annotations
import json
import statistics
import time
from dataclasses import dataclass, field
from pathlib import Path
from typing import Callable
import scipy.stats
from stack.eval import TaskCase, grade_rules, grade_judge
from stack.llm import LLM
Golden-set regression
@dataclass
class RegressionResult:
passed: int
failed: int
cases: list[dict] = field(default_factory=list)
pass_rate: float = 0.0
def is_passing(self, threshold: float = 0.9) -> bool:
return self.pass_rate >= threshold
def run_regression(
cases: list[TaskCase],
runner: Callable[[str], str],
grader: Callable[[TaskCase, str], tuple[bool, str]] = grade_rules,
) -> RegressionResult:
"""Run a fixed set of cases and report pass/fail."""
passed = 0
failed = 0
out_cases = []
for case in cases:
actual = runner(case.input)
ok, reason = grader(case, actual)
out_cases.append({
"id": case.id, "ok": ok,
"input": case.input[:200],
"actual": actual[:300],
"reason": reason,
})
if ok:
passed += 1
else:
failed += 1
total = passed + failed
return RegressionResult(
passed=passed, failed=failed,
cases=out_cases,
pass_rate=passed / total if total else 0.0,
)
Wire it to your live service in CI:
# scripts/regression_test.py
import sys
import httpx
from stack.eval_prod import run_regression
from stack.eval import load_cases
def runner_via_api(text: str) -> str:
"""Hit the live service exactly as a real user would."""
r = httpx.post(
"http://localhost:8000/v1/chat/completions",
headers={"Authorization": "Bearer ${API_KEY}"},
json={"messages": [{"role": "user", "content": text}]},
timeout=60,
)
r.raise_for_status()
return r.json()["choices"][0]["message"]["content"]
if __name__ == "__main__":
cases = load_cases("evals/golden.jsonl")
result = run_regression(cases, runner_via_api)
print(f"pass rate: {result.pass_rate:.1%} ({result.passed}/{result.passed + result.failed})")
for c in result.cases:
if not c["ok"]:
print(f" FAIL [{c['id']}]: {c['reason']}")
sys.exit(0 if result.is_passing(0.9) else 1)
In your CI config:
# .github/workflows/deploy.yml
- name: Boot service
run: docker compose up -d
- name: Regression
run: uv run python scripts/regression_test.py
Exit code drives the deploy. Pass rate below 90% blocks the deploy. You catch the obvious regressions before users do.
A/B prompt testing
The hardest thing about A/B prompt comparison isn’t the code; it’s having the discipline to do it before shipping every prompt change. Most teams eyeball a few examples, declare the new prompt better, and ship. Then quality drops 5% on production traffic. Build the harness, run it, look at the p-value.
# stack/eval_prod.py (continued)
@dataclass
class ABResult:
prompt_a_mean: float
prompt_b_mean: float
delta: float # B - A
p_value: float
significant: bool # True if p < 0.05 and |delta| > min_effect
n_cases: int
def ab_test_prompts(
cases: list[TaskCase],
runner_a: Callable[[str], str],
runner_b: Callable[[str], str],
judge_llm: LLM,
min_effect: float = 0.05,
) -> ABResult:
"""Paired A/B test of two prompt variants on the same cases.
`runner_a` and `runner_b` are functions that wrap the two prompts
and return the model's answer. We grade with an LLM-as-judge for
open-ended tasks; swap to grade_rules for closed ones.
"""
scores_a, scores_b = [], []
for case in cases:
ans_a = runner_a(case.input)
ans_b = runner_b(case.input)
# Side-by-side judging: same judge sees both, scores 1–5.
score_a, _ = grade_judge(judge_llm, case, ans_a)
score_b, _ = grade_judge(judge_llm, case, ans_b)
scores_a.append(score_a)
scores_b.append(score_b)
mean_a = statistics.mean(scores_a)
mean_b = statistics.mean(scores_b)
delta = mean_b - mean_a
# Paired t-test: same cases, two treatments.
_, p_value = scipy.stats.ttest_rel(scores_b, scores_a)
return ABResult(
prompt_a_mean=mean_a, prompt_b_mean=mean_b,
delta=delta, p_value=p_value,
significant=(p_value < 0.05 and abs(delta) >= min_effect),
n_cases=len(cases),
)
Run it:
ab = ab_test_prompts(
cases=load_cases("evals/golden.jsonl"),
runner_a=lambda q: chat_with_prompt(PROMPT_V1, q),
runner_b=lambda q: chat_with_prompt(PROMPT_V2, q),
judge_llm=LLM(),
)
print(f"Prompt A mean: {ab.prompt_a_mean:.2f}")
print(f"Prompt B mean: {ab.prompt_b_mean:.2f}")
print(f"Delta: {ab.delta:+.2f} (p={ab.p_value:.3f})")
print("VERDICT:", "B wins" if ab.significant and ab.delta > 0 else
"A wins" if ab.significant and ab.delta < 0 else
"no significant difference")
Two principles to internalize:
- Paired test, not unpaired. The same case sees both prompts. Variance from “this case is hard for any prompt” cancels out. You’ll see significance with 1/3 the sample size.
- Effect size and p-value. A 0.01 improvement at p=0.001 is statistically real but practically nothing. Set a
min_effectfloor (e.g. 0.05 = 5% of the score range) and reject changes below it even if they’re “significant.”
Drift detection
Drift detection is the live-traffic version of regression. Instead of a fixed test set, you sample N requests per day, judge them, and watch the rolling mean.
# stack/eval_prod.py (continued)
DRIFT_RUBRIC = """\
You are a quality judge. Score this assistant response on a 1–5 scale:
5 = Excellent. Directly answers the user, accurate, well-formatted.
4 = Good. Answers the user with minor flaws.
3 = Acceptable. Mostly addresses the question, some issues.
2 = Poor. Misses key parts of the question or has notable errors.
1 = Bad. Wrong, off-topic, or unhelpful.
Output ONLY a JSON object: {"score": <int>, "reason": "<short reason>"}
"""
@dataclass
class TraceSample:
"""One captured trace with its query and response."""
trace_id: str
timestamp: float
user_query: str
assistant_response: str
def judge_sample(judge_llm: LLM, sample: TraceSample) -> tuple[int, str]:
"""LLM-as-judge over a single live trace sample. Returns (score, reason)."""
response = judge_llm.chat(
messages=[
{"role": "system", "content": DRIFT_RUBRIC},
{"role": "user", "content":
f"USER ASKED: {sample.user_query}\n\n"
f"ASSISTANT REPLIED: {sample.assistant_response}"},
],
temperature=0.0,
)
text = response["choices"][0]["message"]["content"] or "{}"
try:
obj = json.loads(text)
return int(obj["score"]), obj.get("reason", "")
except Exception:
return 0, "judge parse error"
@dataclass
class DriftReport:
window_start: float
window_end: float
n_samples: int
mean_score: float
p10_score: float # 10th percentile — surfaces tail regressions
bad_share: float # fraction with score <= 2
delta_vs_baseline: float | None = None
alert: bool = False
def compute_drift(
samples: list[TraceSample],
judge_llm: LLM,
baseline_mean: float | None = None,
alert_threshold: float = 0.2,
) -> DriftReport:
"""Score N live samples and produce a report.
Alert if the mean drops by more than `alert_threshold` vs baseline.
"""
scores = [judge_sample(judge_llm, s)[0] for s in samples]
scores = [s for s in scores if s > 0] # drop parse errors
mean_score = statistics.mean(scores) if scores else 0.0
p10 = statistics.quantiles(scores, n=10)[0] if len(scores) >= 10 else min(scores)
bad_share = sum(1 for s in scores if s <= 2) / len(scores) if scores else 0.0
delta = None
alert = False
if baseline_mean is not None:
delta = mean_score - baseline_mean
alert = delta < -alert_threshold
return DriftReport(
window_start=min(s.timestamp for s in samples),
window_end=max(s.timestamp for s in samples),
n_samples=len(scores), mean_score=mean_score,
p10_score=p10, bad_share=bad_share,
delta_vs_baseline=delta, alert=alert,
)
load_traces_from_phoenix() would query the Phoenix API for traces in a time window; we’ll skip the implementation here (5 lines of httpx).
Schedule it:
# scripts/daily_drift.py
import datetime as dt
from stack.eval_prod import compute_drift, load_traces_from_phoenix
from stack.llm import LLM
if __name__ == "__main__":
samples = load_traces_from_phoenix(
since=dt.datetime.utcnow() - dt.timedelta(days=1),
sample_size=200,
)
report = compute_drift(samples, LLM(), baseline_mean=4.1)
print(json.dumps({
"n": report.n_samples,
"mean": report.mean_score,
"p10": report.p10_score,
"bad_share": report.bad_share,
"delta": report.delta_vs_baseline,
"alert": report.alert,
}, indent=2))
if report.alert:
# Slack-webhook, PagerDuty, whatever you use.
notify_team(f"Quality drift: mean dropped to {report.mean_score:.2f}")
Cron at 2 a.m., 200 samples a day, baseline anchored to your golden-set score at deploy time. Quality drops 0.2 points and you get paged before users complain.
Feedback to eval pipeline
The last piece. When users thumb-down a response, capture it as a candidate eval case:
# stack/server.py — modified to capture feedback
@app.post("/v1/feedback")
async def feedback(
payload: FeedbackPayload,
api_key: str = Depends(verify_key),
) -> dict:
"""Capture user feedback. Routes thumbs-down to the eval candidate queue."""
record = {
"trace_id": payload.trace_id,
"rating": payload.rating,
"comment": payload.comment,
"ts": int(time.time() * 1000),
}
# Append to local JSONL for now; in prod this is a queue / DB.
Path("evals/feedback.jsonl").open("a").write(json.dumps(record) + "\n")
return {"ok": True}
A short script promotes thumbs-down into golden-set candidates:
# scripts/feedback_to_eval.py
import json
from pathlib import Path
from stack.eval_prod import load_trace
from stack.eval import TaskCase
for line in Path("evals/feedback.jsonl").open():
record = json.loads(line)
if record["rating"] != "down":
continue
trace = load_trace(record["trace_id"])
case = TaskCase(
id=f"feedback-{record['trace_id'][:8]}",
input=trace.user_query,
expected="<TODO: human review needed>",
grader="rules", # or "judge"
)
Path("evals/candidates.jsonl").open("a").write(case.to_json() + "\n")
A human reviews candidates.jsonl weekly, fills in expected outputs, promotes the cases into golden.jsonl. Now your test suite grows with your real failure modes — not the failure modes you imagined at deploy time.
Cross-references
- LLM Evaluation demo — the offline harness from step 04, in interactive form
- LLM as Judge demo — how a judge actually scores responses
- Eval Harness article — the theory side
- Step 12 — the trace data this article consumes
What we did and didn’t do
What we did:
- A regression runner that gates deploys on a 90% pass rate
- A paired A/B harness with a t-test and an effect-size threshold
- A drift detector that scores live samples and alerts on rolling-mean drops
- A feedback collector that promotes thumbs-down into eval candidates
What we didn’t:
- Continuous online learning. Closing the loop by fine-tuning on bad cases. Powerful and risky; defer until you have rock-solid evals (and step 16 covers when this is worth it).
- Multi-judge ensembling. Average over GPT-4o, Claude, and a local model to reduce single-judge bias. ~3× cost; worth it only when single-judge scores stop tracking human ratings.
- Stratified sampling. Sample drift detection by user segment, language, or topic. Important when your traffic is heterogeneous; one extra
GROUP BYin the trace query. - Synthetic data generation for eval. Have a model generate test cases. Useful for getting started; less useful once you have real user data.
Next
Step 14 is cost and latency tuning — caches, batching, KV reuse, quantization, and the four levers behind every serving optimization. The eval pipeline you just built tells you whether quality stays put while you make the system faster and cheaper. That’s not a coincidence; you build evals before you optimize specifically so cost/latency improvements don’t quietly tank quality.