demo

There is no perfect threshold

Slide the decision cutoff. Watch precision, recall, F1, and the ROC curve respond in real time. Every classifier in production lives somewhere on this curve — and the right place depends on whether false positives or false negatives cost more.

The four metrics

  • Precision = TP / (TP + FP) — of predicted positives, how many were right? Care about this when false positives are expensive (spam filter labelling real email as spam).
  • Recall = TP / (TP + FN) — of actual positives, how many did we catch? Care about this when missing positives is expensive (cancer screening).
  • F1 = 2·P·R / (P + R) — harmonic mean. Penalizes the lower of the two; useful when you want both decent.
  • Accuracy = correct / total — fine when classes balanced, garbage when not (a "always predict no cancer" model on a 1% prevalence dataset is 99% accurate and useless).

Try this — predict before you click

  1. Set class separation to 2.0 (well-separated). Slide the threshold from 0 to 1. Predict: at threshold 0.5, both precision and recall hit ~95%. Push threshold to 0.9 — precision approaches 100% (every flagged item is correct) but recall collapses (you missed most positives). Push to 0.1 — recall hits 100% (caught everything) but precision falls (lots of false alarms). The trade is rigid.
  2. Drop class separation to 0.7 (overlapping distributions). The histograms heavily overlap. Predict: AUC drops from ~0.99 to ~0.65 — there is no threshold that gets both metrics high. This is the "your features aren't separating your classes" failure.
  3. At separation 1.5, walk threshold from 0 to 1 and watch the ROC dot trace the curve. Predict: the curve bows up to the top-left corner. The "best" threshold (max F1) is where the dot is closest to (0,1) — usually somewhere near 0.5 for balanced classes.
  4. Re-roll the seed several times at separation 1.0. Predict: the AUC fluctuates by ~0.05 even on the same setup — finite-sample noise. This is why you don't compare model A at AUC 0.852 to model B at AUC 0.853 without confidence intervals.

Anchored to 02-ml-fundamentals/evaluation-and-metrics. Production take: /ship/13 — evaluation in production.