Architectures: CNNs and RNNs

Before transformers, two architectures dominated. Even today, they’re not gone — CNNs still rule low-level vision; RNNs survive in speech, control, and as building blocks for hybrid models.

Convolutional Neural Networks (CNNs)

A convolutional layer applies a small filter (kernel) across spatial positions of the input, sharing weights:

output[i,j] = Σ_{u,v} kernel[u,v] · input[i+u, j+v] + b

Properties:

Translation equivariance: a filter detecting an edge works at every position.
Local connectivity: each output position only depends on a small neighborhood.
Parameter sharing: one kernel, used everywhere — far fewer parameters than an MLP.

nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, padding=1)

Stride, padding, dilation

Stride: how far the kernel moves between applications. Stride > 1 downsamples.
Padding: zeros around the input edges. Lets the kernel “see” boundary pixels.
Dilation: skip pixels in the kernel — larger receptive field without more parameters.

Pooling

Aggregate local activations to downsample:

Max pool: the max in each window.
Average pool: the mean.
Adaptive pooling: collapse to a fixed output size.

Modern CNNs sometimes skip pooling, using strided convolutions instead.

Classic architectures

LeNet (1998): handwritten digits. The blueprint.
AlexNet (2012): ImageNet breakthrough.
VGG (2014): deep stack of 3×3 convs.
ResNet (2015): residual connections enable 100+ layers.
EfficientNet (2019): principled width/depth/resolution scaling.
ConvNeXt (2022): modern conv design competitive with transformers.

Residual connections

output = F(x) + x

Add the input to the output of a sub-block. Gradients flow through the addition unchanged → solves vanishing gradients in deep networks. Now ubiquitous, including in transformers.

When to use CNNs in 2026

Local pattern problems: medical imaging, microscopy, low-level vision.
Compute-constrained inference: CNNs are still cheaper than ViTs at small resolutions.
As a feature extractor: a ResNet backbone before a transformer head is still a competitive recipe.

For “general” image understanding, Vision Transformers (ViTs) and hybrid models have largely replaced pure CNNs at the frontier.

Recurrent Neural Networks (RNNs)

A vanilla RNN processes a sequence one step at a time, carrying a hidden state:

h_t = tanh(W_h h_{t-1} + W_x x_t + b)
y_t = W_y h_t

Properties:

Variable-length input.
Sequential — can’t parallelize over time.
Memory: hidden state carries information forward.
Vanishing/exploding gradients through time make long-range dependencies hard.

LSTMs

Long Short-Term Memory networks add gating:

f = σ(W_f [x, h])     # forget gate
i = σ(W_i [x, h])     # input gate
g = tanh(W_g [x, h])
c = f ⊙ c_prev + i ⊙ g
o = σ(W_o [x, h])     # output gate
h = o ⊙ tanh(c)

The cell state c flows through with multiplicative gating, mitigating vanishing gradients. LSTMs were the dominant sequence architecture from ~2014 to 2017.

GRUs

Gated Recurrent Units — a simpler variant with fewer gates. Often competitive with LSTMs.

Bidirectional RNNs

Run two RNNs, one forward and one backward; concatenate states. Useful when you have the full sequence at training time (e.g. sequence labeling). Not applicable for autoregressive generation.

Sequence-to-sequence (seq2seq)

Encoder RNN reads input, produces final hidden state; decoder RNN starts from that state and generates output. The original neural machine translation architecture (2014).

Bahdanau attention (2015) extended seq2seq with an attention mechanism — letting the decoder selectively look at encoder states. This was the seed of the transformer.

Why transformers replaced RNNs

Parallelism. Transformers process the whole sequence at once. RNNs are step-by-step.
Long-range dependencies. Direct attention reaches anywhere in the sequence; RNNs have to pass info through every intermediate step.
Scaling. Transformers scale beautifully with compute and data; RNNs hit walls.

But RNNs aren’t dead in 2026:

Mamba and state-space models (SSMs) revive recurrent ideas with parallelizable training.
xLSTM and modernized RNN variants compete on long context.
RWKV combines RNN-like inference with transformer-like training.

For most language tasks, transformers still win — but the recurrent thread is alive.

CNNs and RNNs as building blocks today

In modern systems they appear:

Audio: convolutional front-ends in Whisper-style ASR.
Vision: convolutional patch embedders feed Vision Transformers.
Diffusion: U-Nets are convolutional; DiT (Diffusion Transformer) is replacing them at scale.
Tokenization for images/video: VAEs with conv encoders compress to latent grids.

You don’t need to be deep on CNNs/RNNs to build LLM apps, but you’ll meet them — and sometimes beat them — across multimodal stacks.

Exercises

CNN on MNIST: build a 2-conv-layer CNN. Hit >99% test accuracy.
Receptive field: compute the receptive field of a 3-layer CNN with kernel size 3 and stride 1. Now with stride 2.
Char-level RNN: train a 1-layer LSTM to generate text from a small corpus. Notice it forgets context past ~50 chars.
Compare to a tiny transformer: train a 2-layer transformer on the same task. Compare quality.