GPT From Scratch

A complete, runnable, decoder-only transformer language model in ~250 lines of PyTorch. Train on TinyShakespeare; generate plausibly-Shakespearean nonsense; understand every line.

This is heavily inspired by Andrej Karpathy’s nanoGPT — go run nanoGPT once, it’s the best learning resource there is.

The full model

import math
import torch
import torch.nn as nn
import torch.nn.functional as F


class CausalSelfAttention(nn.Module):
    def __init__(self, d_model, num_heads, max_seq_len):
        super().__init__()
        assert d_model % num_heads == 0
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        self.qkv = nn.Linear(d_model, 3 * d_model, bias=False)
        self.proj = nn.Linear(d_model, d_model, bias=False)

        # Causal mask
        self.register_buffer(
            "mask",
            torch.triu(torch.ones(max_seq_len, max_seq_len), diagonal=1).bool(),
        )

    def forward(self, x):
        B, T, C = x.shape
        qkv = self.qkv(x)
        q, k, v = qkv.split(C, dim=-1)
        q = q.view(B, T, self.num_heads, self.d_k).transpose(1, 2)
        k = k.view(B, T, self.num_heads, self.d_k).transpose(1, 2)
        v = v.view(B, T, self.num_heads, self.d_k).transpose(1, 2)

        # Use the optimized SDPA when available
        out = F.scaled_dot_product_attention(q, k, v, is_causal=True)

        out = out.transpose(1, 2).contiguous().view(B, T, C)
        return self.proj(out)


class MLP(nn.Module):
    def __init__(self, d_model, mlp_dim):
        super().__init__()
        self.fc1 = nn.Linear(d_model, mlp_dim)
        self.fc2 = nn.Linear(mlp_dim, d_model)

    def forward(self, x):
        return self.fc2(F.gelu(self.fc1(x)))


class Block(nn.Module):
    def __init__(self, d_model, num_heads, mlp_dim, max_seq_len):
        super().__init__()
        self.norm1 = nn.LayerNorm(d_model)
        self.attn = CausalSelfAttention(d_model, num_heads, max_seq_len)
        self.norm2 = nn.LayerNorm(d_model)
        self.mlp = MLP(d_model, mlp_dim)

    def forward(self, x):
        x = x + self.attn(self.norm1(x))
        x = x + self.mlp(self.norm2(x))
        return x


class GPT(nn.Module):
    def __init__(self, vocab_size, d_model=384, num_heads=6, mlp_dim=1536,
                 num_layers=6, max_seq_len=256):
        super().__init__()
        self.max_seq_len = max_seq_len
        self.tok_embed = nn.Embedding(vocab_size, d_model)
        self.pos_embed = nn.Embedding(max_seq_len, d_model)
        self.blocks = nn.ModuleList([
            Block(d_model, num_heads, mlp_dim, max_seq_len)
            for _ in range(num_layers)
        ])
        self.norm = nn.LayerNorm(d_model)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
        # Weight tying
        self.lm_head.weight = self.tok_embed.weight

    def forward(self, idx, targets=None):
        B, T = idx.shape
        pos = torch.arange(T, device=idx.device)
        x = self.tok_embed(idx) + self.pos_embed(pos)
        for block in self.blocks:
            x = block(x)
        x = self.norm(x)
        logits = self.lm_head(x)

        if targets is None:
            return logits, None

        loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
        return logits, loss

    @torch.no_grad()
    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        for _ in range(max_new_tokens):
            idx_cond = idx[:, -self.max_seq_len:]
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :] / temperature
            if top_k is not None:
                v, _ = torch.topk(logits, top_k)
                logits[logits < v[:, [-1]]] = -float("inf")
            probs = F.softmax(logits, dim=-1)
            next_tok = torch.multinomial(probs, num_samples=1)
            idx = torch.cat((idx, next_tok), dim=1)
        return idx

That’s a complete decoder-only GPT. Let’s train it.

Data: character-level TinyShakespeare

import requests

data = requests.get(
    "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
).text

chars = sorted(set(data))
vocab_size = len(chars)
stoi = {c: i for i, c in enumerate(chars)}
itos = {i: c for c, i in stoi.items()}

encoded = torch.tensor([stoi[c] for c in data], dtype=torch.long)
n = int(0.9 * len(encoded))
train, val = encoded[:n], encoded[n:]

Training loop

import torch
torch.manual_seed(0)

device = "cuda" if torch.cuda.is_available() else "cpu"
batch_size = 64
max_seq_len = 256
model = GPT(vocab_size, max_seq_len=max_seq_len).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.1)


def get_batch(split):
    data = train if split == "train" else val
    ix = torch.randint(len(data) - max_seq_len - 1, (batch_size,))
    x = torch.stack([data[i:i + max_seq_len] for i in ix])
    y = torch.stack([data[i + 1:i + 1 + max_seq_len] for i in ix])
    return x.to(device), y.to(device)


for step in range(5000):
    x, y = get_batch("train")
    _, loss = model(x, y)
    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()
    if step % 500 == 0:
        with torch.no_grad():
            xv, yv = get_batch("val")
            _, vloss = model(xv, yv)
        print(f"step {step}: train {loss.item():.3f} val {vloss.item():.3f}")

After ~5000 steps on a single GPU, you’ll see val loss drop from ~4.0 to ~1.5. Generate:

ctx = torch.zeros((1, 1), dtype=torch.long, device=device)
out = model.generate(ctx, max_new_tokens=500, temperature=0.8, top_k=40)
print("".join(itos[int(i)] for i in out[0]))

You’ll get gibberish that looks structurally like Shakespeare — fake-archaic words, character names, dialogue formatting. The model learned the structure of the text.

What you’ve built

  • A character-level transformer language model.
  • ~5M parameters, trains in minutes on a single GPU.
  • Demonstrates: tokenization, embedding lookup, multi-head attention, position embeddings, residual connections, normalization, MLP blocks, autoregressive sampling.

Scaling up

To turn this toy into something real:

  1. Real tokenizer. Use BPE (tiktoken or tokenizers) instead of characters.
  2. Bigger data. OpenWebText, FineWeb, anything in the hundreds-of-GB range.
  3. More parameters. Increase d_model, num_layers. Watch FLOPs.
  4. RoPE instead of learned positional embeddings.
  5. GQA instead of vanilla multi-head.
  6. Mixed precision (bf16) for 2× speedup.
  7. Distributed training (FSDP) once you outgrow one GPU.
  8. Longer training, with warmup + cosine LR.
  9. Eval beyond loss: HellaSwag, MMLU, Lambada — see Stage 13.

By the time you’ve done all this, you’ve reproduced something like a small open-weights model.

Things to try

  1. Train at multiple sizes. 2-layer, 6-layer, 12-layer. Plot val loss vs parameters. This is your first scaling-laws plot.
  2. Add KV caching to generate(). Time the speedup at long generations.
  3. Replace LayerNorm with RMSNorm. Verify quality is unchanged.
  4. Replace learned pos embeddings with RoPE.
  5. Implement weight tying off and see how much param count grows.
  6. Generate with different temperatures and top_p. Notice sampling effects.

Honest reality check

A model that hits val loss 1.5 on TinyShakespeare is infinitely less capable than a real LLM. It can’t:

  • Answer questions
  • Hold a conversation
  • Write code
  • Reason

To get those, you need:

  • 100,000× more data
  • 100–1000× more parameters
  • Pretraining + SFT + preference tuning

But mechanically, this toy is the same architecture as Claude or GPT or LLaMA. Everything you build from here is “the same, but more.”

Watch it interactively

  • Pipeline — the same architecture (12 layers, decoder-only) running on real GPT-2 small. Watch a prompt flow through the stack: tokenizer → embeddings → 12 layers → logits → sampling.
  • Sampling KnobsPredict before clicking: at temperature 0 the model is deterministic; at T=2 it’s incoherent. The same logits, three knobs (T, top-p, top-k), four very different outputs.
  • Beam Search Lab — decoding strategies on real GPT-2 small logits. Watch greedy vs beam vs sampling pick differently from the same distribution.

Build it in code

See also