Stage 03 — Neural Networks

Stack a few layers of Wx + b interleaved with non-linearities, train with gradient descent. That’s a neural network. Everything modern — transformers, diffusion, LLMs — is built on this.

Prerequisites

  • Stage 01 (calculus, linear algebra)
  • Stage 02 (loss functions, train/val/test)

Learning ladder

  1. Perceptrons & MLPs — the universal approximator
  2. Backpropagation — chain rule + autograd
  3. Activations & initialization — what makes networks trainable
  4. Optimizers — SGD → Adam → AdamW → modern variants
  5. Regularization techniques — dropout, normalization, weight decay
  6. Architectures: CNN & RNN — convolutional and recurrent nets

MVU

You can:

  • Hand-derive backprop through a 2-layer MLP
  • Explain why we use ReLU instead of sigmoid in deep nets
  • Pick a sensible learning rate for a new model
  • Diagnose a training run from loss curves alone

Exercise

Train an MLP on MNIST in raw PyTorch (no nn.Sequential, no nn.Linear — write everything from torch.tensor and torch.matmul). Hit >97% test accuracy. Then add BatchNorm and dropout; observe the difference.

What you’ll build by the end

A clear mental model for every component of a transformer block (Stage 06): linear layer, activation, normalization, residual connection. The transformer is just an MLP with attention layers spliced in.

See also