tiny-llm 08 / 16 18 min read · 25 min hands-on

step 08 · build

Assemble the GPT class

Embeddings + N transformer blocks + final norm + LM head + tied weights. The whole model in 60 lines.

model

We have all the pieces. Step 04 built the embedding. Step 05–06 built attention. Step 07 wrapped it into a transformer block. This step composes them into a complete GPT class — the thing you’ll actually train.

The full model is just:

  1. Embed token IDs to vectors
  2. Run through N transformer blocks
  3. Final layer norm
  4. LM head — project back to vocabulary size

Plus one weight-saving trick we’ll actually do: tying the embedding and the output projection (they share the same parameters). This is a small detail with a real quality and parameter-count impact.

By the end you’ll have a working language model. You can call model(token_ids) and get back logits of shape (B, T, vocab_size) ready for nn.CrossEntropyLoss. You won’t have trained it yet — that’s step 09 — but every architectural piece will be in place.

Configuration object first

Before we write the model, let’s collect every hyperparameter into one place. Otherwise we’d be passing six arguments to every constructor:

# tiny_llm/gpt.py
from dataclasses import dataclass

@dataclass
class GPTConfig:
    """All hyperparameters that define a GPT instance.

    These are the architectural knobs — pick them once, freeze them,
    train. (Training-time knobs like learning rate live elsewhere.)
    """
    vocab_size: int = 4096           # set in step 03 by tokenizer
    max_seq_len: int = 256
    d_model: int = 192
    n_heads: int = 6
    n_layers: int = 6
    mlp_ratio: int = 4

The defaults give us a ~3M-param “tiny” model that fits on a CPU and trains in minutes. We’ll bump these in step 11 to scale up; for now this size lets us iterate fast.

A few words on each:

  • d_model = 192, n_heads = 6, n_layers = 6. d_head = 32, which is on the small side but works at this scale. A real GPT-2 small uses (768, 12, 12); we’ll get there in step 11.
  • max_seq_len = 256. The context window. Memory cost grows quadratically with this (the attention matrix is T × T), so we keep it small to start. The TinyStories average story is ~200 tokens with our vocab, so 256 is enough to fit most stories whole.
  • vocab_size = 4096. Set by the tokenizer in step 03. Has to match the tokenizer the data pipeline uses.

The class

# tiny_llm/gpt.py
import torch
import torch.nn as nn
import torch.nn.functional as F

from tiny_llm.embed import Embed
from tiny_llm.block import Block


class GPT(nn.Module):
    """A decoder-only transformer (the GPT architecture).

    Forward pass:
        token_ids: (B, T)
        →  Embed (token + position):    (B, T, d_model)
        →  N × Block:                    (B, T, d_model)
        →  Final LayerNorm:              (B, T, d_model)
        →  LM head (linear to vocab):   (B, T, vocab_size)

    Returns logits. Apply CrossEntropyLoss yourself; we don't compute
    softmax inside the model so the trainer can use the numerically
    stable combined loss.
    """

    def __init__(self, config: GPTConfig) -> None:
        super().__init__()
        self.config = config

        self.embed = Embed(
            vocab_size=config.vocab_size,
            d_model=config.d_model,
            max_seq_len=config.max_seq_len,
        )

        self.blocks = nn.ModuleList([
            Block(
                d_model=config.d_model,
                n_heads=config.n_heads,
                max_seq_len=config.max_seq_len,
                mlp_ratio=config.mlp_ratio,
            )
            for _ in range(config.n_layers)
        ])

        # Final layer norm before the LM head. This is on the residual
        # stream and is part of the standard GPT recipe.
        self.ln_f = nn.LayerNorm(config.d_model)

        # LM head: project from d_model to vocab. No bias (the standard).
        self.lm_head = nn.Linear(config.d_model, config.vocab_size, bias=False)

        # WEIGHT TYING — explanation below.
        self.lm_head.weight = self.embed.tok_emb.weight

    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
        x = self.embed(token_ids)                # (B, T, d_model)
        for block in self.blocks:
            x = block(x)                          # (B, T, d_model)
        x = self.ln_f(x)                          # (B, T, d_model)
        logits = self.lm_head(x)                  # (B, T, vocab_size)
        return logits

    @property
    def n_params(self) -> int:
        """Total trainable parameters. Subtracts the LM head params
        because they're tied (shared) with the token embedding."""
        all_params = sum(p.numel() for p in self.parameters())
        # If lm_head.weight is the same tensor as tok_emb.weight,
        # PyTorch counts it once in `parameters()`, so this is the
        # truthful number — but we make the intent explicit.
        return all_params

Let me unpack the two non-obvious lines.

Weight tying

self.lm_head.weight = self.embed.tok_emb.weight

This single line does something subtle. Both the token embedding (vocab_size, d_model) and the LM head (d_model, vocab_size) are matrices the same shape (after the LM head’s transpose for matmul). The line makes them the same tensor — same memory, same gradients, same updates.

Three reasons every modern language model does this:

  1. Saves vocab_size · d_model parameters. With a 4096 vocab and d_model=192, that’s ~786k parameters saved per model. For GPT-2 small (50k vocab, 768 d_model), it’s 38M parameters — about 30% of the model.
  2. Improves training quality at the same parameter count. Empirically, tied embeddings reach lower loss faster than untied. The intuition: the input and output should embed words into the same space; using one matrix for both forces consistency.
  3. Inherent regularization. The shared matrix has to be useful in two roles, which is harder than being useful in either alone.

Almost every paper after 2017 uses tied embeddings. GPT-2 ties them; LLaMA ties them; Mistral ties them. We tie them.

In PyTorch, the assignment makes both attributes point at the same nn.Parameter. When optimizer.step() updates tok_emb.weight, it’s also updating lm_head.weight — they’re literally the same memory.

Don’t compute the loss inside forward

Some implementations have forward take an optional targets argument and return (logits, loss). We don’t, for two reasons:

  1. Single responsibility. The model produces logits. The trainer applies the loss. That separation makes generation (step 10) and evaluation (step 13) simpler — they don’t need to pass a phony target.
  2. Easy to add training-time tricks later. Things like label smoothing, mask-language-modeling losses, or DPO objectives all want different loss formulations. Keeping forward purely a logits-producer keeps the model reusable.

The training loop in step 09 will compute the loss explicitly:

logits = model(input_ids)                # (B, T, vocab_size)
loss = F.cross_entropy(
    logits.view(-1, vocab_size),         # flatten to (B*T, vocab_size)
    targets.view(-1)                     # flatten to (B*T,)
)

That’s the whole training-time loss computation. Two lines.

Generation helper

We’ll write a real sampler in step 10, but a simple greedy helper is useful for sanity checks now:

# tiny_llm/gpt.py (continuing the class)
    @torch.no_grad()
    def generate(self, prompt_ids: torch.Tensor, max_new_tokens: int = 50) -> torch.Tensor:
        """Greedy generation. Picks the argmax token at each step.

        prompt_ids: (B, T_prompt) — starting tokens
        returns:     (B, T_prompt + max_new_tokens)
        """
        self.eval()
        ids = prompt_ids
        for _ in range(max_new_tokens):
            # Truncate to context window if we're near the limit.
            ids_cond = ids if ids.size(1) <= self.config.max_seq_len else ids[:, -self.config.max_seq_len:]
            logits = self(ids_cond)             # (B, T, vocab_size)
            next_logits = logits[:, -1, :]      # (B, vocab_size) — last position only
            next_id = next_logits.argmax(dim=-1, keepdim=True)  # (B, 1)
            ids = torch.cat([ids, next_id], dim=1)
        return ids

@torch.no_grad() because we don’t need gradients during generation, and skipping them saves ~50% of forward-pass memory. self.eval() switches any stochastic modules (none in our model, but good habit) to evaluation mode.

Greedy generation always picks the highest-probability next token. It produces bland-but-coherent output and is fully deterministic given the prompt. We’ll add temperature, top-k, and top-p in step 10.

Sanity check

Add at the bottom:

# tiny_llm/gpt.py
if __name__ == "__main__":
    torch.manual_seed(0)

    config = GPTConfig()  # all defaults
    model = GPT(config)

    print(f"config: {config}")
    print(f"\nparameter breakdown:")
    print(f"  embed:    {sum(p.numel() for p in model.embed.parameters()):,}")
    for i, b in enumerate(model.blocks):
        print(f"  block {i}:  {sum(p.numel() for p in b.parameters()):,}")
    print(f"  ln_f:     {sum(p.numel() for p in model.ln_f.parameters()):,}")
    print(f"  lm_head:  {sum(p.numel() for p in model.lm_head.parameters()):,}  (tied to embed)")
    print(f"  TOTAL:    {model.n_params:,}")

    # Forward pass on a batch.
    input_ids = torch.randint(0, config.vocab_size, (2, 32))
    logits = model(input_ids)
    print(f"\ninput shape:  {tuple(input_ids.shape)}")
    print(f"logits shape: {tuple(logits.shape)}")

    # Verify weight tying actually shares memory.
    tied = model.lm_head.weight.data_ptr() == model.embed.tok_emb.weight.data_ptr()
    print(f"\nweight tying active: {tied}")

    # Untrained generation — should produce nonsense, but valid IDs.
    prompt = torch.tensor([[0, 1, 2]])  # arbitrary 3 tokens
    out = model.generate(prompt, max_new_tokens=8)
    print(f"\ngreedy generation (untrained, expect nonsense):")
    print(f"  prompt: {prompt[0].tolist()}")
    print(f"  full:   {out[0].tolist()}")

Run it:

uv run python -m tiny_llm.gpt

Expected output:

config: GPTConfig(vocab_size=4096, max_seq_len=256, d_model=192, n_heads=6, n_layers=6, mlp_ratio=4)

parameter breakdown:
  embed:    835,584
  block 0:  738,432
  block 1:  738,432
  block 2:  738,432
  block 3:  738,432
  block 4:  738,432
  block 5:  738,432
  ln_f:     384
  lm_head:  786,432  (tied to embed)
  TOTAL:    5,266,944

input shape:  (2, 32)
logits shape: (2, 32, 4096)

weight tying active: True

greedy generation (untrained, expect nonsense):
  prompt: [0, 1, 2]
  full:   [0, 1, 2, 1234, 1234, 1234, 1234, 1234, 1234, 1234, 1234]

What to notice:

  • 5.3M parameters total. Real production tier: GPT-2 small is 124M, GPT-2 large is 774M, GPT-3 175B. We’re tiny. But every architectural piece is here. Step 11 walks the scaling knobs.
  • Six blocks at ~738k params each. ~88% of the params. Consistent with the rule we established in step 07: most parameters live in the blocks (especially the MLP).
  • lm_head listed as 786k but the total is correct. PyTorch’s parameters() iterator deduplicates tied parameters — it returns each nn.Parameter once. So model.n_params doesn’t double-count.
  • Generation output: same token repeated. A randomly initialized model has roughly uniform output probabilities; whichever token has the highest random logit gets picked, then keeps getting picked because the context barely shifts the distribution. After training in step 09 this becomes coherent.

Cross-references

  • The Inference Pipeline demo shows the full forward pass we just wrote, but for a real GPT-2 model with 12 blocks of 12 heads each. Open it and step through — every stage in the demo maps to a line of our GPT.forward.
  • The Transformer Block article is the deeper theoretical companion to this implementation.

What we did and didn’t do

What we did:

  • Composed the full GPT architecture from prior pieces
  • A GPTConfig dataclass for clean hyperparameter management
  • Tied input/output embeddings — saves ~786k params on this config
  • Greedy generation helper for quick sanity checks
  • Confirmed end-to-end that (B, T) int tokens become (B, T, vocab_size) logits

What we didn’t:

  • Custom weight init beyond what we did per-component. We initialized embeddings and Linear weights with std=0.02 in their respective files; the nn.LayerNorm defaults (γ=1, β=0) are fine. Some modern recipes (Karpathy’s nanoGPT) apply additional scaling to the residual projections by 1/√(2·n_layers). Not needed at our scale; revisit if training diverges.
  • Activation checkpointing. A memory-saving training trick that recomputes activations in the backward pass. Useful for big models, irrelevant for our 5M-param model.
  • Proper docstrings on every method. I’ve kept comments terse to match the writing style; in your real repo, beef them up.

Next

Step 09 takes this model and trains it. AdamW optimizer, learning-rate schedule with warmup + cosine decay, gradient clipping, periodic eval, checkpoint saving. About 120 lines of train.py that produces a model that can complete sentences from TinyStories.