demo

The math that trains every neural network

Drop a ball on a curve. Compute the slope. Step downhill. Repeat until you find the bottom. That's gradient descent — and that's every neural network on earth, learning.

The three numbers that matter

  • x — the parameter we're tuning. In a neural net, this is millions of weights at once. Here it's one number, so you can see what's happening.
  • f(x) — the loss. How wrong the model is at this setting. Lower is better.
  • ∇f(x) — the gradient. The slope of f at x. It points uphill; we go the opposite way.

The update rule

x ← x − η · ∇f(x)

That's it. η (eta) is the learning rate — how big a step we take. Too big and we overshoot the minimum and bounce around. Too small and we crawl. Most of "tuning a neural network" is finding the right learning rate for your problem.

Things to try in interactive mode

  1. Start on the parabola y = x² with η = 0.1. Press play. The ball glides smoothly to x = 0 — the easy case.
  2. Crank η to 0.9 on the same parabola. The ball oscillates wildly across the minimum. This is what "learning rate too high" looks like in practice.
  3. Switch to y = ¼x⁴ − x² — the double well. With x₀ = 2.6, the ball falls into the right basin. With x₀ = −2.6, the left basin. Same loss, two different "answers" depending on initialization. This is local minima.
  4. Switch to the wavy function. With μ = 0, the ball gets stuck in a tiny local minimum. Crank momentum to 0.9 — the ball smashes through and finds the real minimum. That's why production optimizers (Adam, AdamW) use momentum.

Why this matters

A modern transformer has billions of parameters and a loss function defined over a corpus of trillions of tokens. The landscape is unimaginably high-dimensional. But the rule is the same as the one above. Every step of training is just a many- million-dimensional version of "compute the gradient, take a step downhill". That's the entire learning algorithm.

Anchored to 01-math-foundations/calculus-and-optimization and 03-neural-networks/optimizers from the learning path.