Perceptrons & MLPs

The perceptron (1958)

Frank Rosenblatt’s neuron model:

y = step(w · x + b)

A linear classifier — fires if the input exceeds a threshold. Trained with the perceptron learning rule. Famously couldn’t learn XOR (Minsky & Papert, 1969), which froze the field for years.

The multi-layer perceptron (MLP)

Stack neurons with non-linear activations between them:

h₁ = σ(W₁ x + b₁)
h₂ = σ(W₂ h₁ + b₂)
y  = W₃ h₂ + b₃

Each layer is a linear transformation followed by a non-linearity. The non-linearity is essential — without it, stacked layers collapse to a single linear map.

Universal approximation

A single hidden layer with enough neurons can approximate any continuous function on a compact domain (Cybenko 1989, Hornik 1991).

So why go deeper? Because depth is exponentially more efficient than width for representing many functions. A deep network with N total parameters can capture functions a shallow network would need vastly more parameters for.

A minimal MLP in PyTorch

import torch
import torch.nn as nn
import torch.nn.functional as F

class MLP(nn.Module):
    def __init__(self, in_dim, hidden_dim, out_dim):
        super().__init__()
        self.fc1 = nn.Linear(in_dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, hidden_dim)
        self.fc3 = nn.Linear(hidden_dim, out_dim)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        return self.fc3(x)

model = MLP(784, 256, 10)   # MNIST

That’s a 2-hidden-layer MLP. Add more Linear → activation pairs for depth.

Counting parameters

For a layer Linear(in_dim, out_dim):

  • Weights: in_dim × out_dim
  • Biases: out_dim

For our MNIST MLP:

  • fc1: 784 × 256 + 256 = 200,960
  • fc2: 256 × 256 + 256 = 65,792
  • fc3: 256 × 10 + 10 = 2,570
  • Total: ~270k

Modern LLMs have billions of parameters across thousands of layers and very wide hidden dims.

Inductive biases of MLPs

An MLP is the most general (and most assumption-free) neural architecture. That’s a strength and a weakness:

  • Strength: can fit anything.
  • Weakness: can’t exploit structure. For images, an MLP ignores translation invariance; for sequences, it ignores order/locality.

That’s why we have CNNs (for images) and transformers (for sequences) — they bake in priors that make the search space tractable.

But MLPs aren’t going away: every transformer block has an MLP inside it (the feed-forward network).

Training loop

Conceptually:

for epoch in range(n_epochs):
    for x_batch, y_batch in dataloader:
        # Forward
        logits = model(x_batch)
        loss = F.cross_entropy(logits, y_batch)

        # Backward
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

Five lines that capture all of supervised deep learning. Everything else is detail.

Common shapes

For a batch:

  • Input: (B, in_dim)
  • After fc1: (B, hidden_dim)
  • After fc2: (B, hidden_dim)
  • Output: (B, out_dim)

For a sequence model:

  • Input: (B, T, in_dim) where T is sequence length

When debugging, always print shapes.

Why MLPs aren’t enough for modern AI

Two reasons:

  1. They don’t share parameters across structure. A CNN reuses the same kernel everywhere in an image. A transformer reuses the same attention weights at every position. An MLP has no such sharing — every input dimension gets its own weight.
  2. They scale poorly to long inputs. A 1024-pixel input to a 1024-hidden MLP is already ~1M parameters per layer. For sequences of thousands of tokens, this is impossible.

Convolutions solved (1) for images. Attention solved both for sequences.

Exercises

  1. From scratch in NumPy. Implement a 2-layer MLP forward pass without PyTorch. Verify against a PyTorch version on the same weights.
  2. Param count. Compute parameter count for an MLP with shapes [784, 512, 256, 10].
  3. MLP on tabular data. Train an MLP on the UCI Adult dataset. Compare to logistic regression and gradient-boosted trees. (Spoiler: trees probably win.)
  4. MLP without non-linearity. Replace F.relu with identity. Train. Observe that it’s a linear classifier in disguise.

See also