Stage 03 — Neural Networks

Stack a few layers of Wx + b interleaved with non-linearities, train with gradient descent. That’s a neural network. Everything modern — transformers, diffusion, LLMs — is built on this.

Prerequisites

Stage 01 (calculus, linear algebra)
Stage 02 (loss functions, train/val/test)

Learning ladder

Perceptrons & MLPs — the universal approximator
Backpropagation — chain rule + autograd
Activations & initialization — what makes networks trainable
Optimizers — SGD → Adam → AdamW → modern variants
Regularization techniques — dropout, normalization, weight decay
Architectures: CNN & RNN — convolutional and recurrent nets

MVU

You can:

Hand-derive backprop through a 2-layer MLP
Explain why we use ReLU instead of sigmoid in deep nets
Pick a sensible learning rate for a new model
Diagnose a training run from loss curves alone

Exercise

Train an MLP on MNIST in raw PyTorch (no nn.Sequential, no nn.Linear — write everything from torch.tensor and torch.matmul). Hit >97% test accuracy. Then add BatchNorm and dropout; observe the difference.

What you’ll build by the end

A clear mental model for every component of a transformer block (Stage 06): linear layer, activation, normalization, residual connection. The transformer is just an MLP with attention layers spliced in.