Confusion Matrix Lab · ai-explained

The four metrics

Precision = TP / (TP + FP) — of predicted positives, how many were right? Care about this when false positives are expensive (spam filter labelling real email as spam).
Recall = TP / (TP + FN) — of actual positives, how many did we catch? Care about this when missing positives is expensive (cancer screening).
F1 = 2·P·R / (P + R) — harmonic mean. Penalizes the lower of the two; useful when you want both decent.
Accuracy = correct / total — fine when classes balanced, garbage when not (a "always predict no cancer" model on a 1% prevalence dataset is 99% accurate and useless).

Set class separation to 2.0 (well-separated). Slide the threshold from 0 to 1. Predict: at threshold 0.5, both precision and recall hit ~95%. Push threshold to 0.9 — precision approaches 100% (every flagged item is correct) but recall collapses (you missed most positives). Push to 0.1 — recall hits 100% (caught everything) but precision falls (lots of false alarms). The trade is rigid.
Drop class separation to 0.7 (overlapping distributions). The histograms heavily overlap. Predict: AUC drops from ~0.99 to ~0.65 — there is no threshold that gets both metrics high. This is the "your features aren't separating your classes" failure.
At separation 1.5, walk threshold from 0 to 1 and watch the ROC dot trace the curve. Predict: the curve bows up to the top-left corner. The "best" threshold (max F1) is where the dot is closest to (0,1) — usually somewhere near 0.5 for balanced classes.
Re-roll the seed several times at separation 1.0. Predict: the AUC fluctuates by ~0.05 even on the same setup — finite-sample noise. This is why you don't compare model A at AUC 0.852 to model B at AUC 0.853 without confidence intervals.