demo
Underfit, fit, overfit — pick a degree
Fit a polynomial to noisy data. Slide the degree from 1 to 15. Watch a straight line miss the curve, a degree-3 fit nail it, and a degree-15 fit wiggle through every noise point. The most important plot in ML.
The lesson
Train MSE always falls as the model gets more complex. Test MSE makes a U-shape: bad when too simple (high bias), good in the middle, bad again when too complex (high variance). The sweet spot is where the gap between train and test is smallest.
The math
# closed-form polynomial fit (this is what runs):
β̂ = (XᵀX)⁻¹ Xᵀ y # X has columns [1, x, x², ..., x^d]
ŷ(x) = β̂₀ + β̂₁·x + β̂₂·x² + ... + β̂_d·x^d
# train MSE / test MSE:
MSE = (1/n) Σ (ŷᵢ − yᵢ)² Try this — predict before you click
- Set noise = 0.3, degree = 1. Predict: a straight line misses the sin curve completely. Train and test MSE both high — that's underfitting. Bias dominates.
- Slide degree to 3. Predict: train and test MSE both drop sharply and stay close. Bias and variance balanced — sweet spot.
- Slide degree to 12+. Predict: train MSE keeps dropping but test MSE shoots up. The curve wiggles through every train point but misses test points by a lot. Pure variance — the model is memorizing noise.
- Set noise to 0.0 and re-roll seed. Predict: with no noise, even degree 15 doesn't overfit much (no noise to fit). The bias-variance trade is fundamentally a noise problem.
How it scales to neural networks
Modern transformers are massively overparameterized — they could memorize their training data. They mostly don't, because of regularization (dropout, weight decay), early stopping, data augmentation, and the implicit regularization of SGD itself. Understanding this curve is the prerequisite for understanding every regularization technique.
Anchored to 02-ml-fundamentals/regularization-and-generalization.