demo
The math that trains every neural network
Drop a ball on a curve. Compute the slope. Step downhill. Repeat until you find the bottom. That's gradient descent — and that's every neural network on earth, learning.
The three numbers that matter
- x — the parameter we're tuning. In a neural net, this is millions of weights at once. Here it's one number, so you can see what's happening.
- f(x) — the loss. How wrong the model is at this setting. Lower is better.
- ∇f(x) — the gradient. The slope of f at x. It points uphill; we go the opposite way.
The update rule
x ← x − η · ∇f(x)
That's it. η (eta) is the learning rate —
how big a step we take. Too big and we overshoot the minimum and
bounce around. Too small and we crawl. Most of "tuning a neural
network" is finding the right learning rate for your problem.
Things to try in interactive mode
-
Start on the parabola
y = x²withη = 0.1. Press play. The ball glides smoothly tox = 0— the easy case. -
Crank
ηto0.9on the same parabola. The ball oscillates wildly across the minimum. This is what "learning rate too high" looks like in practice. -
Switch to
y = ¼x⁴ − x²— the double well. Withx₀ = 2.6, the ball falls into the right basin. Withx₀ = −2.6, the left basin. Same loss, two different "answers" depending on initialization. This is local minima. -
Switch to the wavy function. With
μ = 0, the ball gets stuck in a tiny local minimum. Crank momentum to0.9— the ball smashes through and finds the real minimum. That's why production optimizers (Adam, AdamW) use momentum.
Why this matters
A modern transformer has billions of parameters and a loss function defined over a corpus of trillions of tokens. The landscape is unimaginably high-dimensional. But the rule is the same as the one above. Every step of training is just a many- million-dimensional version of "compute the gradient, take a step downhill". That's the entire learning algorithm.
Anchored to 01-math-foundations/calculus-and-optimization
and 03-neural-networks/optimizers
from the learning path.