demo
A model that says 80% should be right 80% of the time
Reliability diagrams reveal how honest a model's confidence is. Most are overconfident; RLHF makes it worse. See how temperature scaling fixes it.
Anchored to 02-ml-fundamentals/evaluation-and-metrics
and 13-production/evaluation-and-benchmarks.