Calculus & Optimization

ML is mostly: define a loss, take the gradient, step downhill, repeat. Calculus tells you what “downhill” means; optimization theory tells you how to step.

Derivatives

The derivative f'(x) tells you how fast f changes as x changes. Geometrically, the slope of the tangent line.

f'(x) = lim_{h→0} (f(x+h) − f(x)) / h

Key derivatives to memorize:

f(x)	f’(x)
`c` (constant)	`0`
`x`	`1`
`x^n`	`n · x^(n−1)`
`e^x`	`e^x`
`log(x)`	`1/x`
`sin(x)`	`cos(x)`

Rules

Sum: (f + g)' = f' + g'
Product: (f · g)' = f' · g + f · g'
Quotient: (f / g)' = (f' g − f g') / g²
Chain: (f(g(x)))' = f'(g(x)) · g'(x) ← the most important one for ML

The chain rule is how backpropagation works. Master it.

Partial derivatives

For a multivariable function f(x, y), the partial derivative ∂f/∂x is the derivative with respect to x, treating y as constant.

f(x, y) = x² + xy + y²
∂f/∂x = 2x + y
∂f/∂y = x + 2y

The gradient

The gradient ∇f(x) of a scalar function over a vector input is the vector of partial derivatives:

∇f(x) = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]

Properties:

Points in the direction of steepest ascent
Length is the rate of steepest ascent
Negated, it points downhill — the direction we move parameters

For a neural network with millions of parameters, the gradient is a million-dimensional vector. Computing it efficiently is what backprop does.

Jacobian and Hessian

Jacobian — matrix of first-order partials when both input and output are vector-valued. Shape (output_dim, input_dim).
Hessian — matrix of second-order partials of a scalar function. Shape (input_dim, input_dim). Tells you curvature.

In modern ML you rarely compute the full Hessian (too expensive), but the concept underlies second-order optimizers like L-BFGS and Newton’s method.

Gradient descent

The fundamental algorithm.

θ_{t+1} = θ_t − η · ∇L(θ_t)

θ — parameters (weights)
L(θ) — loss as a function of parameters
η — learning rate (a positive number)
∇L(θ) — gradient of the loss

You step in the direction that decreases the loss fastest. Repeat until the loss stops dropping.

import numpy as np

def gradient_descent(grad_fn, theta_init, lr=0.01, steps=1000):
    theta = theta_init
    for _ in range(steps):
        theta = theta - lr * grad_fn(theta)
    return theta

Pitfall: if the learning rate is too high, you overshoot and diverge. If too low, you crawl. Tuning it is the most common debug step.

Stochastic gradient descent (SGD)

Computing ∇L on the whole training set is expensive. Instead, sample a mini-batch, compute the gradient on it, step.

θ ← θ − η · ∇L_batch(θ)

The gradient on a batch is a noisy estimate of the true gradient. The noise:

Slows pure convergence
Helps escape sharp minima (a feature, not a bug)
Lets you train models that don’t fit in memory

Batch size is a knob. Tiny batches = noisier, sometimes regularizes. Huge batches = smoother but expensive and may need higher learning rates.

Momentum

Vanilla SGD bounces around in narrow valleys. Momentum smooths the trajectory by accumulating a velocity:

v ← β · v + ∇L(θ)
θ ← θ − η · v

β ≈ 0.9 typically. The velocity carries you through small bumps.

Adam, AdamW

The default optimizer for neural networks. Maintains:

A momentum-like running average of gradients (first moment)
A running average of squared gradients (second moment, for per-parameter learning rates)

m_t = β₁ m_{t-1} + (1 − β₁) g_t
v_t = β₂ v_{t-1} + (1 − β₂) g_t²
θ_t ← θ_{t-1} − η · m_t / (√v_t + ε)

AdamW decouples weight decay from the gradient update — this matters for transformers and is the modern default.

Typical hyperparameters: β₁=0.9, β₂=0.95 or 0.999, η=1e-4 to 3e-4 for transformers.

Learning rate schedules

The learning rate isn’t usually constant.

Warmup: linearly ramp from 0 to peak over the first few thousand steps. Critical for transformers.
Cosine decay: smooth ramp-down from peak to a small fraction.
Step decay: drop by 10× at fixed milestones. Old-school, still works.
One-cycle: warmup, then linear decay. Popular with super-convergence.

A common modern recipe: linear warmup + cosine decay to ~10% of peak.

Convex vs non-convex

A function is convex if every line segment between two points on its graph lies above the graph. Convex functions have a single global minimum, and gradient descent finds it.

Neural network losses are non-convex. They have many local minima, saddle points, and flat regions. Gradient descent doesn’t guarantee global optimality.

In practice: deep learning works anyway. Most local minima are good enough; saddle points are escapable; over-parameterization helps.

Saddle points and flat minima

In high dimensions, saddle points (gradient zero, some directions go up, some down) are far more common than local minima. Optimizers need to navigate them, not get stuck.

Flat minima are wider basins; they generalize better than sharp ones. Some optimization tricks (SAM = Sharpness-Aware Minimization) explicitly target flatness.

Vanishing and exploding gradients

Through deep networks, gradients can shrink to zero or blow up to infinity. Mitigations:

Better activations (ReLU instead of sigmoid)
Residual connections (let gradients flow around layers)
Layer/batch normalization
Gradient clipping (cap the global norm)
Careful initialization (Xavier, He)

We’ll see all of these in Stage 03.

Backpropagation in one paragraph

Compute the loss in a forward pass. Then for each parameter, apply the chain rule starting at the loss and walking backward through the computation graph. Each layer’s contribution is a local Jacobian-vector product. Modern frameworks (PyTorch, JAX) build the graph automatically — you write loss.backward() and get param.grad populated for every parameter.

Practical optimization advice

Start with AdamW. Don’t tune the optimizer first.
Learning rate is the most important hyperparameter. Sweep it before anything else.
Use warmup for anything attention-based. No exceptions.
Clip gradients at norm 1.0 if training is unstable.
Watch loss + gradient norms. If gradient norm explodes, something is wrong.
Mixed precision (fp16/bf16) is essentially free speedup — turn it on.

Exercises

Derive backprop. For y = w₂ · σ(w₁ · x + b₁) + b₂ where σ is sigmoid, derive ∂L/∂w₁ for L = (y − target)².
Plot GD trajectories. Minimize f(x, y) = x² + 10y² (an elongated bowl) with vanilla SGD and with momentum. Compare paths.
Tune the learning rate. Train an MLP on MNIST with η = 1e-1, 1e-2, 1e-3, 1e-4, 1e-5. Plot loss curves. Where does training fail? Where is it slow?
Implement Adam. From scratch in NumPy on a toy quadratic. Verify it converges faster than SGD.