Architectures: CNNs and RNNs

Before transformers, two architectures dominated. Even today, they’re not gone — CNNs still rule low-level vision; RNNs survive in speech, control, and as building blocks for hybrid models.

Convolutional Neural Networks (CNNs)

A convolutional layer applies a small filter (kernel) across spatial positions of the input, sharing weights:

output[i,j] = Σ_{u,v} kernel[u,v] · input[i+u, j+v] + b

Properties:

  • Translation equivariance: a filter detecting an edge works at every position.
  • Local connectivity: each output position only depends on a small neighborhood.
  • Parameter sharing: one kernel, used everywhere — far fewer parameters than an MLP.
nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, padding=1)

Stride, padding, dilation

  • Stride: how far the kernel moves between applications. Stride > 1 downsamples.
  • Padding: zeros around the input edges. Lets the kernel “see” boundary pixels.
  • Dilation: skip pixels in the kernel — larger receptive field without more parameters.

Pooling

Aggregate local activations to downsample:

  • Max pool: the max in each window.
  • Average pool: the mean.
  • Adaptive pooling: collapse to a fixed output size.

Modern CNNs sometimes skip pooling, using strided convolutions instead.

Classic architectures

  • LeNet (1998): handwritten digits. The blueprint.
  • AlexNet (2012): ImageNet breakthrough.
  • VGG (2014): deep stack of 3×3 convs.
  • ResNet (2015): residual connections enable 100+ layers.
  • EfficientNet (2019): principled width/depth/resolution scaling.
  • ConvNeXt (2022): modern conv design competitive with transformers.

Residual connections

output = F(x) + x

Add the input to the output of a sub-block. Gradients flow through the addition unchanged → solves vanishing gradients in deep networks. Now ubiquitous, including in transformers.

When to use CNNs in 2026

  • Local pattern problems: medical imaging, microscopy, low-level vision.
  • Compute-constrained inference: CNNs are still cheaper than ViTs at small resolutions.
  • As a feature extractor: a ResNet backbone before a transformer head is still a competitive recipe.

For “general” image understanding, Vision Transformers (ViTs) and hybrid models have largely replaced pure CNNs at the frontier.

Recurrent Neural Networks (RNNs)

A vanilla RNN processes a sequence one step at a time, carrying a hidden state:

h_t = tanh(W_h h_{t-1} + W_x x_t + b)
y_t = W_y h_t

Properties:

  • Variable-length input.
  • Sequential — can’t parallelize over time.
  • Memory: hidden state carries information forward.
  • Vanishing/exploding gradients through time make long-range dependencies hard.

LSTMs

Long Short-Term Memory networks add gating:

f = σ(W_f [x, h])     # forget gate
i = σ(W_i [x, h])     # input gate
g = tanh(W_g [x, h])
c = f ⊙ c_prev + i ⊙ g
o = σ(W_o [x, h])     # output gate
h = o ⊙ tanh(c)

The cell state c flows through with multiplicative gating, mitigating vanishing gradients. LSTMs were the dominant sequence architecture from ~2014 to 2017.

GRUs

Gated Recurrent Units — a simpler variant with fewer gates. Often competitive with LSTMs.

Bidirectional RNNs

Run two RNNs, one forward and one backward; concatenate states. Useful when you have the full sequence at training time (e.g. sequence labeling). Not applicable for autoregressive generation.

Sequence-to-sequence (seq2seq)

Encoder RNN reads input, produces final hidden state; decoder RNN starts from that state and generates output. The original neural machine translation architecture (2014).

Bahdanau attention (2015) extended seq2seq with an attention mechanism — letting the decoder selectively look at encoder states. This was the seed of the transformer.

Why transformers replaced RNNs

  1. Parallelism. Transformers process the whole sequence at once. RNNs are step-by-step.
  2. Long-range dependencies. Direct attention reaches anywhere in the sequence; RNNs have to pass info through every intermediate step.
  3. Scaling. Transformers scale beautifully with compute and data; RNNs hit walls.

But RNNs aren’t dead in 2026:

  • Mamba and state-space models (SSMs) revive recurrent ideas with parallelizable training.
  • xLSTM and modernized RNN variants compete on long context.
  • RWKV combines RNN-like inference with transformer-like training.

For most language tasks, transformers still win — but the recurrent thread is alive.

CNNs and RNNs as building blocks today

In modern systems they appear:

  • Audio: convolutional front-ends in Whisper-style ASR.
  • Vision: convolutional patch embedders feed Vision Transformers.
  • Diffusion: U-Nets are convolutional; DiT (Diffusion Transformer) is replacing them at scale.
  • Tokenization for images/video: VAEs with conv encoders compress to latent grids.

You don’t need to be deep on CNNs/RNNs to build LLM apps, but you’ll meet them — and sometimes beat them — across multimodal stacks.

Exercises

  1. CNN on MNIST: build a 2-conv-layer CNN. Hit >99% test accuracy.
  2. Receptive field: compute the receptive field of a 3-layer CNN with kernel size 3 and stride 1. Now with stride 2.
  3. Char-level RNN: train a 1-layer LSTM to generate text from a small corpus. Notice it forgets context past ~50 chars.
  4. Compare to a tiny transformer: train a 2-layer transformer on the same task. Compare quality.

See also