Calculus & Optimization
ML is mostly: define a loss, take the gradient, step downhill, repeat. Calculus tells you what “downhill” means; optimization theory tells you how to step.
Derivatives
The derivative f'(x) tells you how fast f changes as x changes. Geometrically, the slope of the tangent line.
f'(x) = lim_{h→0} (f(x+h) − f(x)) / h
Key derivatives to memorize:
| f(x) | f’(x) |
|---|---|
c (constant) | 0 |
x | 1 |
x^n | n · x^(n−1) |
e^x | e^x |
log(x) | 1/x |
sin(x) | cos(x) |
Rules
- Sum:
(f + g)' = f' + g' - Product:
(f · g)' = f' · g + f · g' - Quotient:
(f / g)' = (f' g − f g') / g² - Chain:
(f(g(x)))' = f'(g(x)) · g'(x)← the most important one for ML
The chain rule is how backpropagation works. Master it.
Partial derivatives
For a multivariable function f(x, y), the partial derivative ∂f/∂x is the derivative with respect to x, treating y as constant.
f(x, y) = x² + xy + y²
∂f/∂x = 2x + y
∂f/∂y = x + 2y
The gradient
The gradient ∇f(x) of a scalar function over a vector input is the vector of partial derivatives:
∇f(x) = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]
Properties:
- Points in the direction of steepest ascent
- Length is the rate of steepest ascent
- Negated, it points downhill — the direction we move parameters
For a neural network with millions of parameters, the gradient is a million-dimensional vector. Computing it efficiently is what backprop does.
Jacobian and Hessian
- Jacobian — matrix of first-order partials when both input and output are vector-valued. Shape
(output_dim, input_dim). - Hessian — matrix of second-order partials of a scalar function. Shape
(input_dim, input_dim). Tells you curvature.
In modern ML you rarely compute the full Hessian (too expensive), but the concept underlies second-order optimizers like L-BFGS and Newton’s method.
Gradient descent
The fundamental algorithm.
θ_{t+1} = θ_t − η · ∇L(θ_t)
θ— parameters (weights)L(θ)— loss as a function of parametersη— learning rate (a positive number)∇L(θ)— gradient of the loss
You step in the direction that decreases the loss fastest. Repeat until the loss stops dropping.
import numpy as np
def gradient_descent(grad_fn, theta_init, lr=0.01, steps=1000):
theta = theta_init
for _ in range(steps):
theta = theta - lr * grad_fn(theta)
return theta
Pitfall: if the learning rate is too high, you overshoot and diverge. If too low, you crawl. Tuning it is the most common debug step.
Stochastic gradient descent (SGD)
Computing ∇L on the whole training set is expensive. Instead, sample a mini-batch, compute the gradient on it, step.
θ ← θ − η · ∇L_batch(θ)
The gradient on a batch is a noisy estimate of the true gradient. The noise:
- Slows pure convergence
- Helps escape sharp minima (a feature, not a bug)
- Lets you train models that don’t fit in memory
Batch size is a knob. Tiny batches = noisier, sometimes regularizes. Huge batches = smoother but expensive and may need higher learning rates.
Momentum
Vanilla SGD bounces around in narrow valleys. Momentum smooths the trajectory by accumulating a velocity:
v ← β · v + ∇L(θ)
θ ← θ − η · v
β ≈ 0.9 typically. The velocity carries you through small bumps.
Adam, AdamW
The default optimizer for neural networks. Maintains:
- A momentum-like running average of gradients (first moment)
- A running average of squared gradients (second moment, for per-parameter learning rates)
m_t = β₁ m_{t-1} + (1 − β₁) g_t
v_t = β₂ v_{t-1} + (1 − β₂) g_t²
θ_t ← θ_{t-1} − η · m_t / (√v_t + ε)
AdamW decouples weight decay from the gradient update — this matters for transformers and is the modern default.
Typical hyperparameters: β₁=0.9, β₂=0.95 or 0.999, η=1e-4 to 3e-4 for transformers.
Learning rate schedules
The learning rate isn’t usually constant.
- Warmup: linearly ramp from 0 to peak over the first few thousand steps. Critical for transformers.
- Cosine decay: smooth ramp-down from peak to a small fraction.
- Step decay: drop by 10× at fixed milestones. Old-school, still works.
- One-cycle: warmup, then linear decay. Popular with super-convergence.
A common modern recipe: linear warmup + cosine decay to ~10% of peak.
Convex vs non-convex
A function is convex if every line segment between two points on its graph lies above the graph. Convex functions have a single global minimum, and gradient descent finds it.
Neural network losses are non-convex. They have many local minima, saddle points, and flat regions. Gradient descent doesn’t guarantee global optimality.
In practice: deep learning works anyway. Most local minima are good enough; saddle points are escapable; over-parameterization helps.
Saddle points and flat minima
In high dimensions, saddle points (gradient zero, some directions go up, some down) are far more common than local minima. Optimizers need to navigate them, not get stuck.
Flat minima are wider basins; they generalize better than sharp ones. Some optimization tricks (SAM = Sharpness-Aware Minimization) explicitly target flatness.
Vanishing and exploding gradients
Through deep networks, gradients can shrink to zero or blow up to infinity. Mitigations:
- Better activations (ReLU instead of sigmoid)
- Residual connections (let gradients flow around layers)
- Layer/batch normalization
- Gradient clipping (cap the global norm)
- Careful initialization (Xavier, He)
We’ll see all of these in Stage 03.
Backpropagation in one paragraph
Compute the loss in a forward pass. Then for each parameter, apply the chain rule starting at the loss and walking backward through the computation graph. Each layer’s contribution is a local Jacobian-vector product. Modern frameworks (PyTorch, JAX) build the graph automatically — you write loss.backward() and get param.grad populated for every parameter.
Practical optimization advice
- Start with AdamW. Don’t tune the optimizer first.
- Learning rate is the most important hyperparameter. Sweep it before anything else.
- Use warmup for anything attention-based. No exceptions.
- Clip gradients at norm 1.0 if training is unstable.
- Watch loss + gradient norms. If gradient norm explodes, something is wrong.
- Mixed precision (fp16/bf16) is essentially free speedup — turn it on.
Exercises
- Derive backprop. For
y = w₂ · σ(w₁ · x + b₁) + b₂whereσis sigmoid, derive∂L/∂w₁forL = (y − target)². - Plot GD trajectories. Minimize
f(x, y) = x² + 10y²(an elongated bowl) with vanilla SGD and with momentum. Compare paths. - Tune the learning rate. Train an MLP on MNIST with
η = 1e-1, 1e-2, 1e-3, 1e-4, 1e-5. Plot loss curves. Where does training fail? Where is it slow? - Implement Adam. From scratch in NumPy on a toy quadratic. Verify it converges faster than SGD.
See also
- Probability & statistics — losses are expectations
- Information theory — what we minimize
- Stage 03 — Backpropagation
- Stage 03 — Optimizers