Evaluation & Metrics

A model is only as good as your ability to measure it. Picking the wrong metric is one of the most common — and most expensive — ML mistakes.

The confusion matrix

For binary classification:

	Predicted +	Predicted −
Actual +	True Positive	False Negative
Actual −	False Positive	True Negative

Everything below comes from these four counts.

Accuracy

Accuracy = (TP + TN) / (TP + TN + FP + FN)

The fraction of correct predictions. Easy to compute, often misleading.

Pitfall: with class imbalance, accuracy is useless. 99% of credit-card transactions are legit — predict “legit” always and you’re 99% accurate, while missing every fraud.

Precision, recall, F1

Precision = TP / (TP + FP). “Of what I flagged, how much was actually positive?”
Recall = TP / (TP + FN). “Of what was actually positive, how much did I catch?”
F1 = harmonic mean of precision and recall = 2PR / (P + R).

Use precision when false positives are expensive (e.g. spam → don’t block real email). Use recall when false negatives are expensive (e.g. cancer screening). Use F1 when you want a single number balancing both.

ROC and AUC

The ROC curve plots TPR (recall) vs FPR (= FP / (FP + TN)) at every classification threshold.

AUC-ROC is the area under that curve. Properties:

Range: 0.5 (random) to 1.0 (perfect).
Threshold-independent — measures how well the model ranks positives above negatives.
Robust to class imbalance.

AUC ≈ “probability that a random positive scores higher than a random negative.” The most useful single number for binary classifiers.

For severely imbalanced problems, PR-AUC (precision-recall AUC) is more informative than ROC-AUC.

Multi-class metrics

Macro-averaged: average per-class metric. Treats all classes equally.
Micro-averaged: aggregate counts across classes, then compute. Treats all examples equally.
Weighted: weighted by class support.

For imbalanced multi-class problems, macro-F1 is the standard “fairness across classes” metric. Use micro/weighted when you care about overall throughput.

Regression metrics

Metric	Formula	Notes
MSE	`mean((y − ŷ)²)`	Same as the loss; in squared units
RMSE	`√MSE`	In original units
MAE	`mean(	y − ŷ
R²	`1 − SS_res/SS_tot`	Fraction of variance explained; can be negative
MAPE	`mean(	y − ŷ

Calibration

A 90%-confident classifier should be right 90% of the time. Many neural nets aren’t — they’re overconfident.

Measure with expected calibration error (ECE) or a reliability diagram (predicted prob vs. observed accuracy, binned).

Fixes:

Temperature scaling on a held-out set
Label smoothing during training
Mixup augmentation
Modern variants like Platt scaling, isotonic regression

Ranking & retrieval metrics

For search, recommenders, retrieval:

Recall@k: of the relevant items, how many in top k?
Precision@k: of the top k, how many are relevant?
MRR (Mean Reciprocal Rank): 1 / position of first relevant result.
nDCG (Normalized Discounted Cumulative Gain): relevance-weighted, position-discounted.
MAP (Mean Average Precision): average precision averaged over queries.

In RAG (Stage 09), recall@k is the headline retrieval metric. nDCG dominates classical IR research.

LLM-specific metrics

LLMs broke many assumptions of classical ML metrics. New ones emerged:

Perplexity: lower = better next-token prediction. Used during pretraining.
Exact match / F1 (for QA): does the answer string match?
BLEU, ROUGE, METEOR (for translation/summarization): n-gram overlap. Imperfect but cheap.
BERTScore: semantic similarity using BERT embeddings.
Pass@k (for code): does at least one of k generated programs pass the tests?
LLM-as-judge: another model rates outputs. Cheap, fast, biased — see Stage 13 for caveats.
Human preference: gold standard but expensive.

Modern LLM evals usually combine these — task-specific metrics + LLM-as-judge + human spot checks.

Statistical significance

When comparing two models on a benchmark, the gap might be noise. To check:

Bootstrap: resample the test set, recompute the metric, repeat. The 95% CI tells you the noise floor.
McNemar’s test: for paired binary classifications.
Sign test / paired t-test: for paired metric scores.

If your bootstrapped CIs overlap, the new model isn’t better — it’s the same model with more hype.

Designing an eval

The hardest part of ML, especially for LLM systems.

A good eval has:

Multiple slices: overall, per-class, per-cohort, per-difficulty. A single average hides everything.
A frozen test set: don’t change it after seeing results.
A label provenance you trust: spot-check labels yourself.
A baseline: a trivial model (always predict majority, BM25, GPT-3.5) to compare against.
A rubric: what counts as “good enough”? Set the bar before you measure.

For LLM products specifically (Stage 13), evals also need:

Production trace replay
Adversarial inputs
Drift monitoring

Pitfalls

Optimizing the metric instead of the goal. Goodhart’s law. A click-through metric does not mean engagement; engagement does not mean satisfaction.
Overfitting to the eval. If you tune on the test set, even unintentionally, scores inflate.
Single-number reporting. Always report distribution / multiple slices / confidence intervals.
Comparing across data shifts. Yesterday’s 92% and today’s 91% might be the same model on harder data.

Exercises

Build a confusion matrix for a binary classifier with predictions [0,1,1,0,1,1,0,0] and labels [0,1,0,0,1,1,1,0]. Compute accuracy, precision, recall, F1, by hand.
AUC by ranking. Given scores [0.9, 0.4, 0.7, 0.2, 0.5] and labels [1, 0, 1, 0, 1], compute AUC by counting correctly-ordered pairs.
Calibration plot. Train a model on a small dataset, plot predicted probability vs actual frequency in 10 bins.
Bootstrap a CI. For a model with 92% accuracy on 500 examples, compute a 95% CI on the accuracy via 1000 bootstrap resamples.