step 07 · build

The transformer block

LayerNorm → MultiHead → residual → LayerNorm → MLP → residual. The pattern stacked N times to make a deep transformer.

model

A transformer block is two operations stacked: multi-head attention and a feed-forward MLP, with layer norm before each and residual connections wrapping each. That’s it. The whole architecture is this block repeated N times, sandwiched between an embedding layer (step 04) and an output projection (step 08).

By the end of this step you’ll have a Block module that takes a (B, T, d_model) tensor and returns a (B, T, d_model) tensor — same shape contract as everything else. About 60 lines of code. We’ll also resolve three small mysteries on the way:

Why pre-norm instead of the post-norm in the original “Attention Is All You Need” paper
Why the MLP expands to 4× the model dimension
Why GELU replaces ReLU between the two MLP linears

What we’re building

The block, in pseudocode:

def block(x):
    x = x + attn(layer_norm(x))   # attention sub-layer
    x = x + mlp(layer_norm(x))    # MLP sub-layer
    return x

Two sub-layers, each shaped the same: norm, then transform, then add to the running residual stream.

Calling this the “residual stream” is more than aesthetic — it’s the load-bearing concept. The vector at position t after N blocks is just the sum of N+1 contributions: the original embedding, plus what each block decided to add. Every block reads from the residual stream (via norm), produces an update (attention or MLP), and writes that update back. Like a noticeboard each block updates without overwriting the older notes.

This insight matters when you read papers about induction heads, feature circuits, or anything else from the interpretability literature. They all describe layers as small contributions to a shared residual stream, exactly like the architecture we’re building here.

Pre-norm vs post-norm

The original transformer paper put layer norm after each sub-layer, inside the residual:

# post-norm: original "Attention Is All You Need"
x = layer_norm(x + attn(x))
x = layer_norm(x + mlp(x))

Modern decoder-only LLMs (GPT-2 onward, LLaMA, Qwen, Mixtral, all of them) put layer norm before the sub-layer, outside the residual:

# pre-norm: GPT-2 onwards
x = x + attn(layer_norm(x))
x = x + mlp(layer_norm(x))

Why the change? Because pre-norm trains stably at depth and post-norm doesn’t.

In post-norm, the residual stream gets renormalized after every block — so the magnitudes downstream don’t grow. That sounds good, but it means gradients from the loss have to flow back through N LayerNorms in sequence, and LayerNorm’s gradient depends on the data distribution. At depth ≥ 12 with random init, this becomes unstable. You either need careful warm-up, magic init constants, or you use pre-norm.

Pre-norm puts LayerNorm inside the sub-layer, leaving the residual path unobstructed — gradients flow back through the sum operation directly to early layers, regardless of depth. Pre-norm models train at 100+ layers without special tricks; post-norm models don’t.

If you’ve used the Layer Norm demo on this site, you’ve seen what LayerNorm does to a distribution — center, normalize, rescale via learned γ and β. The pre-norm pattern means we apply this normalization to the input of each sub-layer, not the output.

The MLP

The feed-forward block between attention sub-layers:

def mlp(x):
    h = linear(d_model → 4·d_model)(x)
    h = gelu(h)
    return linear(4·d_model → d_model)(h)

Three things baked into the convention.

The 4× expansion factor is empirical. The original transformer paper used d_ff = 4 · d_model; subsequent ablations showed it sits in a sweet spot. Smaller (2×) underfits; larger (8×) doesn’t help in proportion to its parameter cost. Most modern models stick with 4× even though some (Llama 3) use slightly different ratios with SwiGLU activations.

GELU instead of ReLU. ReLU has a sharp kink at zero (gradient jumps from 0 to 1). GELU is a smooth approximation that matches ReLU asymptotically but is differentiable everywhere — it tends to give slightly better convergence in transformer-shaped models. GPT-2 introduced GELU in this position; almost everyone followed.

Two linears, no biases on the second. The hidden activation is computed by linear(d → 4d) followed by GELU, then projected back to d by linear(4d → d). We typically drop the bias on the projection-back layer because it’s redundant with the layer norm that comes next.

In parameter-count terms, the MLP is the majority of the parameters in a transformer. With d_model=128, d_ff=512, the MLP has 128·512 + 512·128 = 131,072 params per block. The attention sub-layer has 4·128·128 = 65,536. MLP is 2× attention’s params at every block; for a 12-block model, that’s where the parameters live.

Setup

Open a new file:

# tiny_llm/block.py
import torch
import torch.nn as nn

from tiny_llm.mha import MultiHeadAttention

We import MultiHeadAttention from step 06 — it’s a drop-in component now.

The MLP module

Pull the MLP out into its own class — it’s reused later:

# tiny_llm/block.py
class MLP(nn.Module):
    """Two-layer feed-forward network with GELU activation.

    d_model → 4*d_model → d_model.
    """

    def __init__(self, d_model: int, mlp_ratio: int = 4) -> None:
        super().__init__()
        d_ff = mlp_ratio * d_model
        self.fc1 = nn.Linear(d_model, d_ff, bias=False)
        self.fc2 = nn.Linear(d_ff, d_model, bias=False)

        # Same standard init we used in mha.py and embed.py.
        nn.init.normal_(self.fc1.weight, mean=0.0, std=0.02)
        nn.init.normal_(self.fc2.weight, mean=0.0, std=0.02)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.fc2(nn.functional.gelu(self.fc1(x)))

Eleven lines. Not much to it.

The Block class

# tiny_llm/block.py
class Block(nn.Module):
    """One transformer block: pre-norm + attention + residual,
    then pre-norm + MLP + residual.

    Input:  (B, T, d_model)
    Output: (B, T, d_model)
    """

    def __init__(
        self,
        d_model: int,
        n_heads: int,
        max_seq_len: int,
        mlp_ratio: int = 4,
    ) -> None:
        super().__init__()
        # Two layer norms — one per sub-layer. Each has its own
        # learned γ (scale) and β (shift) parameters.
        self.ln_1 = nn.LayerNorm(d_model)
        self.ln_2 = nn.LayerNorm(d_model)

        self.attn = MultiHeadAttention(
            d_model=d_model,
            n_heads=n_heads,
            max_seq_len=max_seq_len,
        )
        self.mlp = MLP(d_model=d_model, mlp_ratio=mlp_ratio)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # Sub-layer 1: attention with pre-norm + residual.
        x = x + self.attn(self.ln_1(x))
        # Sub-layer 2: MLP with pre-norm + residual.
        x = x + self.mlp(self.ln_2(x))
        return x

That’s the whole block. Twenty lines including the comments.

The forward pass is two lines of math. Each line is “norm, transform, add.” The compactness here is real: every transformer-decoder model on earth — GPT-2, GPT-3, GPT-4 (rumored), LLaMA, Qwen, Mixtral, DeepSeek — uses this exact two-line pattern. Modern variants swap LayerNorm for RMSNorm, GELU for SwiGLU, learned positions for RoPE, but the residual stream + pre-norm + (attn, MLP) skeleton is universal.

Sanity check

Add at the bottom:

# tiny_llm/block.py
if __name__ == "__main__":
    torch.manual_seed(0)

    block = Block(d_model=64, n_heads=8, max_seq_len=16)
    print(f"params: {sum(p.numel() for p in block.parameters()):,}")
    print(f"  attn:    {sum(p.numel() for p in block.attn.parameters()):,}")
    print(f"  mlp:     {sum(p.numel() for p in block.mlp.parameters()):,}")
    print(f"  ln_1+2:  {sum(p.numel() for p in list(block.ln_1.parameters()) + list(block.ln_2.parameters())):,}")

    x = torch.randn(2, 12, 64)
    out = block(x)
    print(f"\ninput shape:  {tuple(x.shape)}")
    print(f"output shape: {tuple(out.shape)}")
    print(f"shapes match: {x.shape == out.shape}")

    # Residual sanity: with the block initialized to small weights, the
    # output should be close to the input (the residual path dominates).
    delta = (out - x).std() / x.std()
    print(f"\nrelative change ‖out − x‖ / ‖x‖: {delta:.3f}")
    print(f"(small init → small delta → residual path is doing its job)")

Run it:

uv run python -m tiny_llm.block

Expected output:

params: 82,304
  attn:    16,384
  mlp:     65,536
  ln_1+2:  256

input shape:  (2, 12, 64)
output shape: (2, 12, 64)
shapes match: True

relative change ‖out − x‖ / ‖x‖: 0.080
(small init → small delta → residual path is doing its job)

What to notice:

MLP is 4× the attention params (65k vs 16k). Same ratio at every model size. In production transformers the MLP is where most weights live.
Layer norms are tiny — 256 params total for two layer norms (each is just γ and β of size d_model=64, so 4 × 64 = 256). Almost free.
Output shape unchanged — we can stack as many of these as we want and the tensor shape stays (B, T, d_model).
Small relative change at init. With weights initialized to std=0.02, the sub-layer outputs are small relative to the input. The residual path passes the input through nearly unchanged, which is exactly what we want at init: a randomly initialized transformer should be close to the identity, and training is what teaches the sub-layers to make meaningful changes. This is the property pre-norm preserves and post-norm breaks.

Cross-reference

The Transformer Block article walks through the same architecture from the math side. The Layer Norm demo shows what nn.LayerNorm does to a distribution interactively — slide the input distribution and watch center → normalize → rescale.

If you wrote LayerNorm from scratch, it’s:

def layer_norm(x, gamma, beta, eps=1e-5):
    mean = x.mean(dim=-1, keepdim=True)
    var = x.var(dim=-1, keepdim=True, unbiased=False)
    x_hat = (x - mean) / torch.sqrt(var + eps)
    return gamma * x_hat + beta

nn.LayerNorm(d_model) does this for us, with learned γ (init: ones) and β (init: zeros). At init the layer norm is the identity — γ=1, β=0 means gamma * x_hat + beta = x_hat, the standardized input. Training nudges these to whatever the model needs.

What we did and didn’t do

What we did:

The standard pre-norm transformer block: x + attn(ln(x)), then x + mlp(ln(x))
An MLP module with the conventional 4× expansion and GELU activation
Layer norm pulled out of the residual path so gradients flow freely
Confirmed shape preservation and small initial perturbation magnitude

What we didn’t:

RMSNorm. LLaMA-style models replace LayerNorm with RMSNorm — drops the mean-subtraction term to save FLOPs. Slightly faster, identical training quality at our scale. We use the better-known LayerNorm for clarity.
SwiGLU. LLaMA’s MLP is swiglu(x) = (linear(x) · sigmoid(linear(x))) · linear(x) — three linears instead of two, more parameters but better quality per parameter. We stick with the simpler GELU form.
DropPath / Stochastic Depth. Some training recipes randomly skip blocks at training time. Not needed at our scale.

Step 08 assembles the full GPT class — embedding (step 04), N of these blocks, a final layer norm, and an output projection to vocab size. Plus the trick of tying the input and output embeddings to save parameters and improve training. By the end of step 08 you can call model(token_ids) and get logits over the vocabulary.