Why Transformers Won

The 2017 paper “Attention Is All You Need” (Vaswani et al.) didn’t propose a new task. It proposed an architecture that solved the same problem RNNs did, better and faster, by replacing recurrence with self-attention. Within five years, every major language model was a transformer. Why?

The fundamental shift

RNNs encode the sequence step-by-step into a hidden state. Each step has access to a compressed summary of everything before. Long-range information must survive many gates and many gradient passes.

Transformers throw out the sequential update entirely. At each layer, every token directly attends to every other token. Distance in the sequence stops mattering for information flow — only attention weight matters.

RNN: each token → next via hidden state
Transformer: every token ↔ every other token, in parallel

Five reasons transformers won

1. Parallelism

In a transformer, the forward pass on a sequence of length T is essentially a sequence of matrix multiplications. All T positions are processed simultaneously. GPUs love this.

In an RNN, position t+1 depends on position t. You can batch across sequences but not across timesteps within a sequence. GPUs sit idle waiting for the next step.

For training, this gives transformers ~10–100× higher throughput on the same hardware.

2. Long-range dependencies

In an RNN, information from token 1 reaches token 100 by being repeatedly transformed and gated through 99 hidden state updates. Each pass is a chance for the signal to attenuate.

In a transformer, token 100 directly attends to token 1 in a single attention operation. The gradient flow is just as direct.

This is why transformers handle long-range dependencies (cross-paragraph references, code dependencies, long conversations) much better than RNNs.

3. Scaling laws

Transformers respond to scale predictably. As you increase parameters, data, and compute together, perplexity drops along power-law curves (Kaplan et al. 2020, Hoffmann et al. 2022).

RNNs scale, but with diminishing returns and worse stability. Train a 70B-parameter LSTM and you get something worse than a 7B-parameter transformer.

This predictability is what unlocks foundation-model economics: you can decide ahead of time, “I want a model with X capability — I need to spend Y on compute.”

4. Inductive bias

A transformer’s inductive bias is “every position can talk to every position, weighted by content.” This is more flexible than an RNN’s “information flows forward through a hidden state” — and turns out to match natural language better.

Code references variables thousands of tokens earlier. Stories reference characters introduced in chapter 1. Conversations reference earlier turns. All of these are well-suited to attention.

5. Architectural simplicity

A transformer block is:

LayerNorm
Multi-head self-attention
Residual add
LayerNorm
Feed-forward MLP
Residual add

That’s it. The same block, stacked N times. No gates, no recurrent state, no special handling.

This simplicity made transformers easy to optimize (kernel-level), to scale (parallel and distributed training), and to extend (multimodal — same block works on patches/audio/text).

What about quadratic complexity?

Self-attention is O(T²) in sequence length. For T = 1M tokens, the attention matrix has 10^12 entries. Concerns about this drove a decade of “linear attention” research.

In practice:

For most applications T < 100k → quadratic is fine on modern hardware.
For longer contexts, hardware-aware optimizations (FlashAttention, paged attention) keep effective complexity low.
Hybrid architectures (Mamba/Jamba/MoE-Mamba) mix attention with linear-time mechanisms for very long contexts.

The quadratic cost has been more solvable than the RNN bottleneck of “can’t parallelize over time.”

The role of pretraining

Transformers became dominant through the pretraining + fine-tuning paradigm:

Pretrain on massive unlabeled corpora with a self-supervised objective (next-token prediction or masked-token prediction).
Fine-tune or prompt for downstream tasks.

This works for transformers because:

They scale.
Self-attention is a flexible enough mechanism to learn many tasks.
Pretraining produces general-purpose features.

Pretraining also worked for RNN-LMs, but the transformer’s scaling efficiency made it the platform for foundation models.

What transformers don’t do well (yet)

Truly long contexts (≥10M tokens) — quadratic cost or accuracy degradation.
Continuous online learning — they’re typically frozen at deployment.
Real-time generation at very low latency — autoregressive decoding is sequential at inference time.
Strong systematic generalization to genuinely novel compositions (open research question).

These are gaps that are being attacked through architecture (Mamba, RWKV), training (better data mixes, RLHF, reasoning tuning), and inference (KV caching, speculative decoding).

The takeaway

Transformers won not because attention is magical but because:

They train fast (parallel).
They reach far (no recurrent bottleneck).
They scale predictably (scaling laws).
They’re simple (one block, replicated).
They benefited massively from self-supervised pretraining at scale.

Every modern AI model — chat, code, image, video, audio — is built on this architecture or something very close to it. Stage 06 unpacks how it works.