Evaluation & Metrics
A model is only as good as your ability to measure it. Picking the wrong metric is one of the most common — and most expensive — ML mistakes.
The confusion matrix
For binary classification:
| Predicted + | Predicted − | |
|---|---|---|
| Actual + | True Positive | False Negative |
| Actual − | False Positive | True Negative |
Everything below comes from these four counts.
Accuracy
Accuracy = (TP + TN) / (TP + TN + FP + FN)
The fraction of correct predictions. Easy to compute, often misleading.
Pitfall: with class imbalance, accuracy is useless. 99% of credit-card transactions are legit — predict “legit” always and you’re 99% accurate, while missing every fraud.
Precision, recall, F1
- Precision = TP / (TP + FP). “Of what I flagged, how much was actually positive?”
- Recall = TP / (TP + FN). “Of what was actually positive, how much did I catch?”
- F1 = harmonic mean of precision and recall = 2PR / (P + R).
Use precision when false positives are expensive (e.g. spam → don’t block real email). Use recall when false negatives are expensive (e.g. cancer screening). Use F1 when you want a single number balancing both.
ROC and AUC
The ROC curve plots TPR (recall) vs FPR (= FP / (FP + TN)) at every classification threshold.
AUC-ROC is the area under that curve. Properties:
- Range: 0.5 (random) to 1.0 (perfect).
- Threshold-independent — measures how well the model ranks positives above negatives.
- Robust to class imbalance.
AUC ≈ “probability that a random positive scores higher than a random negative.” The most useful single number for binary classifiers.
For severely imbalanced problems, PR-AUC (precision-recall AUC) is more informative than ROC-AUC.
Multi-class metrics
- Macro-averaged: average per-class metric. Treats all classes equally.
- Micro-averaged: aggregate counts across classes, then compute. Treats all examples equally.
- Weighted: weighted by class support.
For imbalanced multi-class problems, macro-F1 is the standard “fairness across classes” metric. Use micro/weighted when you care about overall throughput.
Regression metrics
| Metric | Formula | Notes |
|---|---|---|
| MSE | mean((y − ŷ)²) | Same as the loss; in squared units |
| RMSE | √MSE | In original units |
| MAE | `mean( | y − ŷ |
| R² | 1 − SS_res/SS_tot | Fraction of variance explained; can be negative |
| MAPE | `mean( | y − ŷ |
Calibration
A 90%-confident classifier should be right 90% of the time. Many neural nets aren’t — they’re overconfident.
Measure with expected calibration error (ECE) or a reliability diagram (predicted prob vs. observed accuracy, binned).
Fixes:
- Temperature scaling on a held-out set
- Label smoothing during training
- Mixup augmentation
- Modern variants like Platt scaling, isotonic regression
Ranking & retrieval metrics
For search, recommenders, retrieval:
- Recall@k: of the relevant items, how many in top k?
- Precision@k: of the top k, how many are relevant?
- MRR (Mean Reciprocal Rank): 1 / position of first relevant result.
- nDCG (Normalized Discounted Cumulative Gain): relevance-weighted, position-discounted.
- MAP (Mean Average Precision): average precision averaged over queries.
In RAG (Stage 09), recall@k is the headline retrieval metric. nDCG dominates classical IR research.
LLM-specific metrics
LLMs broke many assumptions of classical ML metrics. New ones emerged:
- Perplexity: lower = better next-token prediction. Used during pretraining.
- Exact match / F1 (for QA): does the answer string match?
- BLEU, ROUGE, METEOR (for translation/summarization): n-gram overlap. Imperfect but cheap.
- BERTScore: semantic similarity using BERT embeddings.
- Pass@k (for code): does at least one of k generated programs pass the tests?
- LLM-as-judge: another model rates outputs. Cheap, fast, biased — see Stage 13 for caveats.
- Human preference: gold standard but expensive.
Modern LLM evals usually combine these — task-specific metrics + LLM-as-judge + human spot checks.
Statistical significance
When comparing two models on a benchmark, the gap might be noise. To check:
- Bootstrap: resample the test set, recompute the metric, repeat. The 95% CI tells you the noise floor.
- McNemar’s test: for paired binary classifications.
- Sign test / paired t-test: for paired metric scores.
If your bootstrapped CIs overlap, the new model isn’t better — it’s the same model with more hype.
Designing an eval
The hardest part of ML, especially for LLM systems.
A good eval has:
- Multiple slices: overall, per-class, per-cohort, per-difficulty. A single average hides everything.
- A frozen test set: don’t change it after seeing results.
- A label provenance you trust: spot-check labels yourself.
- A baseline: a trivial model (always predict majority, BM25, GPT-3.5) to compare against.
- A rubric: what counts as “good enough”? Set the bar before you measure.
For LLM products specifically (Stage 13), evals also need:
- Production trace replay
- Adversarial inputs
- Drift monitoring
Pitfalls
- Optimizing the metric instead of the goal. Goodhart’s law. A click-through metric does not mean engagement; engagement does not mean satisfaction.
- Overfitting to the eval. If you tune on the test set, even unintentionally, scores inflate.
- Single-number reporting. Always report distribution / multiple slices / confidence intervals.
- Comparing across data shifts. Yesterday’s 92% and today’s 91% might be the same model on harder data.
Exercises
- Build a confusion matrix for a binary classifier with predictions
[0,1,1,0,1,1,0,0]and labels[0,1,0,0,1,1,1,0]. Compute accuracy, precision, recall, F1, by hand. - AUC by ranking. Given scores
[0.9, 0.4, 0.7, 0.2, 0.5]and labels[1, 0, 1, 0, 1], compute AUC by counting correctly-ordered pairs. - Calibration plot. Train a model on a small dataset, plot predicted probability vs actual frequency in 10 bins.
- Bootstrap a CI. For a model with 92% accuracy on 500 examples, compute a 95% CI on the accuracy via 1000 bootstrap resamples.