Architectures: CNNs and RNNs
Before transformers, two architectures dominated. Even today, they’re not gone — CNNs still rule low-level vision; RNNs survive in speech, control, and as building blocks for hybrid models.
Convolutional Neural Networks (CNNs)
A convolutional layer applies a small filter (kernel) across spatial positions of the input, sharing weights:
output[i,j] = Σ_{u,v} kernel[u,v] · input[i+u, j+v] + b
Properties:
- Translation equivariance: a filter detecting an edge works at every position.
- Local connectivity: each output position only depends on a small neighborhood.
- Parameter sharing: one kernel, used everywhere — far fewer parameters than an MLP.
nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, padding=1)
Stride, padding, dilation
- Stride: how far the kernel moves between applications. Stride > 1 downsamples.
- Padding: zeros around the input edges. Lets the kernel “see” boundary pixels.
- Dilation: skip pixels in the kernel — larger receptive field without more parameters.
Pooling
Aggregate local activations to downsample:
- Max pool: the max in each window.
- Average pool: the mean.
- Adaptive pooling: collapse to a fixed output size.
Modern CNNs sometimes skip pooling, using strided convolutions instead.
Classic architectures
- LeNet (1998): handwritten digits. The blueprint.
- AlexNet (2012): ImageNet breakthrough.
- VGG (2014): deep stack of 3×3 convs.
- ResNet (2015): residual connections enable 100+ layers.
- EfficientNet (2019): principled width/depth/resolution scaling.
- ConvNeXt (2022): modern conv design competitive with transformers.
Residual connections
output = F(x) + x
Add the input to the output of a sub-block. Gradients flow through the addition unchanged → solves vanishing gradients in deep networks. Now ubiquitous, including in transformers.
When to use CNNs in 2026
- Local pattern problems: medical imaging, microscopy, low-level vision.
- Compute-constrained inference: CNNs are still cheaper than ViTs at small resolutions.
- As a feature extractor: a ResNet backbone before a transformer head is still a competitive recipe.
For “general” image understanding, Vision Transformers (ViTs) and hybrid models have largely replaced pure CNNs at the frontier.
Recurrent Neural Networks (RNNs)
A vanilla RNN processes a sequence one step at a time, carrying a hidden state:
h_t = tanh(W_h h_{t-1} + W_x x_t + b)
y_t = W_y h_t
Properties:
- Variable-length input.
- Sequential — can’t parallelize over time.
- Memory: hidden state carries information forward.
- Vanishing/exploding gradients through time make long-range dependencies hard.
LSTMs
Long Short-Term Memory networks add gating:
f = σ(W_f [x, h]) # forget gate
i = σ(W_i [x, h]) # input gate
g = tanh(W_g [x, h])
c = f ⊙ c_prev + i ⊙ g
o = σ(W_o [x, h]) # output gate
h = o ⊙ tanh(c)
The cell state c flows through with multiplicative gating, mitigating vanishing gradients. LSTMs were the dominant sequence architecture from ~2014 to 2017.
GRUs
Gated Recurrent Units — a simpler variant with fewer gates. Often competitive with LSTMs.
Bidirectional RNNs
Run two RNNs, one forward and one backward; concatenate states. Useful when you have the full sequence at training time (e.g. sequence labeling). Not applicable for autoregressive generation.
Sequence-to-sequence (seq2seq)
Encoder RNN reads input, produces final hidden state; decoder RNN starts from that state and generates output. The original neural machine translation architecture (2014).
Bahdanau attention (2015) extended seq2seq with an attention mechanism — letting the decoder selectively look at encoder states. This was the seed of the transformer.
Why transformers replaced RNNs
- Parallelism. Transformers process the whole sequence at once. RNNs are step-by-step.
- Long-range dependencies. Direct attention reaches anywhere in the sequence; RNNs have to pass info through every intermediate step.
- Scaling. Transformers scale beautifully with compute and data; RNNs hit walls.
But RNNs aren’t dead in 2026:
- Mamba and state-space models (SSMs) revive recurrent ideas with parallelizable training.
- xLSTM and modernized RNN variants compete on long context.
- RWKV combines RNN-like inference with transformer-like training.
For most language tasks, transformers still win — but the recurrent thread is alive.
CNNs and RNNs as building blocks today
In modern systems they appear:
- Audio: convolutional front-ends in Whisper-style ASR.
- Vision: convolutional patch embedders feed Vision Transformers.
- Diffusion: U-Nets are convolutional; DiT (Diffusion Transformer) is replacing them at scale.
- Tokenization for images/video: VAEs with conv encoders compress to latent grids.
You don’t need to be deep on CNNs/RNNs to build LLM apps, but you’ll meet them — and sometimes beat them — across multimodal stacks.
Exercises
- CNN on MNIST: build a 2-conv-layer CNN. Hit >99% test accuracy.
- Receptive field: compute the receptive field of a 3-layer CNN with kernel size 3 and stride 1. Now with stride 2.
- Char-level RNN: train a 1-layer LSTM to generate text from a small corpus. Notice it forgets context past ~50 chars.
- Compare to a tiny transformer: train a 2-layer transformer on the same task. Compare quality.
See also
- Stage 04 — Why transformers — why we mostly moved past RNNs
- Stage 06 — Transformers
- Stage 12 — Vision-language models