Perceptrons & MLPs
The perceptron (1958)
Frank Rosenblatt’s neuron model:
y = step(w · x + b)
A linear classifier — fires if the input exceeds a threshold. Trained with the perceptron learning rule. Famously couldn’t learn XOR (Minsky & Papert, 1969), which froze the field for years.
The multi-layer perceptron (MLP)
Stack neurons with non-linear activations between them:
h₁ = σ(W₁ x + b₁)
h₂ = σ(W₂ h₁ + b₂)
y = W₃ h₂ + b₃
Each layer is a linear transformation followed by a non-linearity. The non-linearity is essential — without it, stacked layers collapse to a single linear map.
Universal approximation
A single hidden layer with enough neurons can approximate any continuous function on a compact domain (Cybenko 1989, Hornik 1991).
So why go deeper? Because depth is exponentially more efficient than width for representing many functions. A deep network with N total parameters can capture functions a shallow network would need vastly more parameters for.
A minimal MLP in PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
class MLP(nn.Module):
def __init__(self, in_dim, hidden_dim, out_dim):
super().__init__()
self.fc1 = nn.Linear(in_dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, hidden_dim)
self.fc3 = nn.Linear(hidden_dim, out_dim)
def forward(self, x):
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
return self.fc3(x)
model = MLP(784, 256, 10) # MNIST
That’s a 2-hidden-layer MLP. Add more Linear → activation pairs for depth.
Counting parameters
For a layer Linear(in_dim, out_dim):
- Weights:
in_dim × out_dim - Biases:
out_dim
For our MNIST MLP:
- fc1: 784 × 256 + 256 = 200,960
- fc2: 256 × 256 + 256 = 65,792
- fc3: 256 × 10 + 10 = 2,570
- Total: ~270k
Modern LLMs have billions of parameters across thousands of layers and very wide hidden dims.
Inductive biases of MLPs
An MLP is the most general (and most assumption-free) neural architecture. That’s a strength and a weakness:
- Strength: can fit anything.
- Weakness: can’t exploit structure. For images, an MLP ignores translation invariance; for sequences, it ignores order/locality.
That’s why we have CNNs (for images) and transformers (for sequences) — they bake in priors that make the search space tractable.
But MLPs aren’t going away: every transformer block has an MLP inside it (the feed-forward network).
Training loop
Conceptually:
for epoch in range(n_epochs):
for x_batch, y_batch in dataloader:
# Forward
logits = model(x_batch)
loss = F.cross_entropy(logits, y_batch)
# Backward
optimizer.zero_grad()
loss.backward()
optimizer.step()
Five lines that capture all of supervised deep learning. Everything else is detail.
Common shapes
For a batch:
- Input:
(B, in_dim) - After fc1:
(B, hidden_dim) - After fc2:
(B, hidden_dim) - Output:
(B, out_dim)
For a sequence model:
- Input:
(B, T, in_dim)where T is sequence length
When debugging, always print shapes.
Why MLPs aren’t enough for modern AI
Two reasons:
- They don’t share parameters across structure. A CNN reuses the same kernel everywhere in an image. A transformer reuses the same attention weights at every position. An MLP has no such sharing — every input dimension gets its own weight.
- They scale poorly to long inputs. A 1024-pixel input to a 1024-hidden MLP is already ~1M parameters per layer. For sequences of thousands of tokens, this is impossible.
Convolutions solved (1) for images. Attention solved both for sequences.
Exercises
- From scratch in NumPy. Implement a 2-layer MLP forward pass without PyTorch. Verify against a PyTorch version on the same weights.
- Param count. Compute parameter count for an MLP with shapes
[784, 512, 256, 10]. - MLP on tabular data. Train an MLP on the UCI Adult dataset. Compare to logistic regression and gradient-boosted trees. (Spoiler: trees probably win.)
- MLP without non-linearity. Replace
F.reluwith identity. Train. Observe that it’s a linear classifier in disguise.
See also
- Backpropagation — how an MLP actually learns
- Activations & initialization — picking
σ - Stage 06 — The transformer block — MLPs inside transformers