Gradient Descent · ai-explained

The three numbers that matter

x — the parameter we're tuning. In a neural net, this is millions of weights at once. Here it's one number, so you can see what's happening.
f(x) — the loss. How wrong the model is at this setting. Lower is better.
∇f(x) — the gradient. The slope of f at x. It points uphill; we go the opposite way.

The update rule

x ← x − η · ∇f(x)

That's it. η (eta) is the learning rate — how big a step we take. Too big and we overshoot the minimum and bounce around. Too small and we crawl. Most of "tuning a neural network" is finding the right learning rate for your problem.

Things to try in interactive mode

Start on the parabola y = x² with η = 0.1. Press play. The ball glides smoothly to x = 0 — the easy case.
Crank η to 0.9 on the same parabola. The ball oscillates wildly across the minimum. This is what "learning rate too high" looks like in practice.
Switch to y = ¼x⁴ − x² — the double well. With x₀ = 2.6, the ball falls into the right basin. With x₀ = −2.6, the left basin. Same loss, two different "answers" depending on initialization. This is local minima.
Switch to the wavy function. With μ = 0, the ball gets stuck in a tiny local minimum. Crank momentum to 0.9 — the ball smashes through and finds the real minimum. That's why production optimizers (Adam, AdamW) use momentum.

Why this matters

A modern transformer has billions of parameters and a loss function defined over a corpus of trillions of tokens. The landscape is unimaginably high-dimensional. But the rule is the same as the one above. Every step of training is just a many- million-dimensional version of "compute the gradient, take a step downhill". That's the entire learning algorithm.

Anchored to 01-math-foundations/calculus-and-optimization and 03-neural-networks/optimizers from the learning path.

The math that trains every neural network

The three numbers that matter

The update rule

Things to try in interactive mode

Why this matters