Stage 03 — Neural Networks
Stack a few layers of Wx + b interleaved with non-linearities, train with gradient descent. That’s a neural network. Everything modern — transformers, diffusion, LLMs — is built on this.
Prerequisites
- Stage 01 (calculus, linear algebra)
- Stage 02 (loss functions, train/val/test)
Learning ladder
- Perceptrons & MLPs — the universal approximator
- Backpropagation — chain rule + autograd
- Activations & initialization — what makes networks trainable
- Optimizers — SGD → Adam → AdamW → modern variants
- Regularization techniques — dropout, normalization, weight decay
- Architectures: CNN & RNN — convolutional and recurrent nets
MVU
You can:
- Hand-derive backprop through a 2-layer MLP
- Explain why we use ReLU instead of sigmoid in deep nets
- Pick a sensible learning rate for a new model
- Diagnose a training run from loss curves alone
Exercise
Train an MLP on MNIST in raw PyTorch (no nn.Sequential, no nn.Linear — write everything from torch.tensor and torch.matmul). Hit >97% test accuracy. Then add BatchNorm and dropout; observe the difference.
What you’ll build by the end
A clear mental model for every component of a transformer block (Stage 06): linear layer, activation, normalization, residual connection. The transformer is just an MLP with attention layers spliced in.
See also
- Stage 04 — Language modeling — RNN-based language models
- Stage 06 — Transformers