demo

A model that says 80% should be right 80% of the time

Reliability diagrams reveal how honest a model's confidence is. Most are overconfident; RLHF makes it worse. See how temperature scaling fixes it.

Anchored to 02-ml-fundamentals/evaluation-and-metrics and 13-production/evaluation-and-benchmarks.