Evaluation & Metrics

A model is only as good as your ability to measure it. Picking the wrong metric is one of the most common — and most expensive — ML mistakes.

The confusion matrix

For binary classification:

Predicted +Predicted −
Actual +True PositiveFalse Negative
Actual −False PositiveTrue Negative

Everything below comes from these four counts.

Accuracy

Accuracy = (TP + TN) / (TP + TN + FP + FN)

The fraction of correct predictions. Easy to compute, often misleading.

Pitfall: with class imbalance, accuracy is useless. 99% of credit-card transactions are legit — predict “legit” always and you’re 99% accurate, while missing every fraud.

Precision, recall, F1

  • Precision = TP / (TP + FP). “Of what I flagged, how much was actually positive?”
  • Recall = TP / (TP + FN). “Of what was actually positive, how much did I catch?”
  • F1 = harmonic mean of precision and recall = 2PR / (P + R).

Use precision when false positives are expensive (e.g. spam → don’t block real email). Use recall when false negatives are expensive (e.g. cancer screening). Use F1 when you want a single number balancing both.

ROC and AUC

The ROC curve plots TPR (recall) vs FPR (= FP / (FP + TN)) at every classification threshold.

AUC-ROC is the area under that curve. Properties:

  • Range: 0.5 (random) to 1.0 (perfect).
  • Threshold-independent — measures how well the model ranks positives above negatives.
  • Robust to class imbalance.

AUC ≈ “probability that a random positive scores higher than a random negative.” The most useful single number for binary classifiers.

For severely imbalanced problems, PR-AUC (precision-recall AUC) is more informative than ROC-AUC.

Multi-class metrics

  • Macro-averaged: average per-class metric. Treats all classes equally.
  • Micro-averaged: aggregate counts across classes, then compute. Treats all examples equally.
  • Weighted: weighted by class support.

For imbalanced multi-class problems, macro-F1 is the standard “fairness across classes” metric. Use micro/weighted when you care about overall throughput.

Regression metrics

MetricFormulaNotes
MSEmean((y − ŷ)²)Same as the loss; in squared units
RMSE√MSEIn original units
MAE`mean(y − ŷ
1 − SS_res/SS_totFraction of variance explained; can be negative
MAPE`mean(y − ŷ

Calibration

A 90%-confident classifier should be right 90% of the time. Many neural nets aren’t — they’re overconfident.

Measure with expected calibration error (ECE) or a reliability diagram (predicted prob vs. observed accuracy, binned).

Fixes:

  • Temperature scaling on a held-out set
  • Label smoothing during training
  • Mixup augmentation
  • Modern variants like Platt scaling, isotonic regression

Ranking & retrieval metrics

For search, recommenders, retrieval:

  • Recall@k: of the relevant items, how many in top k?
  • Precision@k: of the top k, how many are relevant?
  • MRR (Mean Reciprocal Rank): 1 / position of first relevant result.
  • nDCG (Normalized Discounted Cumulative Gain): relevance-weighted, position-discounted.
  • MAP (Mean Average Precision): average precision averaged over queries.

In RAG (Stage 09), recall@k is the headline retrieval metric. nDCG dominates classical IR research.

LLM-specific metrics

LLMs broke many assumptions of classical ML metrics. New ones emerged:

  • Perplexity: lower = better next-token prediction. Used during pretraining.
  • Exact match / F1 (for QA): does the answer string match?
  • BLEU, ROUGE, METEOR (for translation/summarization): n-gram overlap. Imperfect but cheap.
  • BERTScore: semantic similarity using BERT embeddings.
  • Pass@k (for code): does at least one of k generated programs pass the tests?
  • LLM-as-judge: another model rates outputs. Cheap, fast, biased — see Stage 13 for caveats.
  • Human preference: gold standard but expensive.

Modern LLM evals usually combine these — task-specific metrics + LLM-as-judge + human spot checks.

Statistical significance

When comparing two models on a benchmark, the gap might be noise. To check:

  • Bootstrap: resample the test set, recompute the metric, repeat. The 95% CI tells you the noise floor.
  • McNemar’s test: for paired binary classifications.
  • Sign test / paired t-test: for paired metric scores.

If your bootstrapped CIs overlap, the new model isn’t better — it’s the same model with more hype.

Designing an eval

The hardest part of ML, especially for LLM systems.

A good eval has:

  1. Multiple slices: overall, per-class, per-cohort, per-difficulty. A single average hides everything.
  2. A frozen test set: don’t change it after seeing results.
  3. A label provenance you trust: spot-check labels yourself.
  4. A baseline: a trivial model (always predict majority, BM25, GPT-3.5) to compare against.
  5. A rubric: what counts as “good enough”? Set the bar before you measure.

For LLM products specifically (Stage 13), evals also need:

  • Production trace replay
  • Adversarial inputs
  • Drift monitoring

Pitfalls

  • Optimizing the metric instead of the goal. Goodhart’s law. A click-through metric does not mean engagement; engagement does not mean satisfaction.
  • Overfitting to the eval. If you tune on the test set, even unintentionally, scores inflate.
  • Single-number reporting. Always report distribution / multiple slices / confidence intervals.
  • Comparing across data shifts. Yesterday’s 92% and today’s 91% might be the same model on harder data.

Exercises

  1. Build a confusion matrix for a binary classifier with predictions [0,1,1,0,1,1,0,0] and labels [0,1,0,0,1,1,1,0]. Compute accuracy, precision, recall, F1, by hand.
  2. AUC by ranking. Given scores [0.9, 0.4, 0.7, 0.2, 0.5] and labels [1, 0, 1, 0, 1], compute AUC by counting correctly-ordered pairs.
  3. Calibration plot. Train a model on a small dataset, plot predicted probability vs actual frequency in 10 bins.
  4. Bootstrap a CI. For a model with 92% accuracy on 500 examples, compute a 95% CI on the accuracy via 1000 bootstrap resamples.

See also