GPT From Scratch
A complete, runnable, decoder-only transformer language model in ~250 lines of PyTorch. Train on TinyShakespeare; generate plausibly-Shakespearean nonsense; understand every line.
This is heavily inspired by Andrej Karpathy’s nanoGPT — go run nanoGPT once, it’s the best learning resource there is.
The full model
import math
import torch
import torch.nn as nn
import torch.nn.functional as F
class CausalSelfAttention(nn.Module):
def __init__(self, d_model, num_heads, max_seq_len):
super().__init__()
assert d_model % num_heads == 0
self.num_heads = num_heads
self.d_k = d_model // num_heads
self.qkv = nn.Linear(d_model, 3 * d_model, bias=False)
self.proj = nn.Linear(d_model, d_model, bias=False)
# Causal mask
self.register_buffer(
"mask",
torch.triu(torch.ones(max_seq_len, max_seq_len), diagonal=1).bool(),
)
def forward(self, x):
B, T, C = x.shape
qkv = self.qkv(x)
q, k, v = qkv.split(C, dim=-1)
q = q.view(B, T, self.num_heads, self.d_k).transpose(1, 2)
k = k.view(B, T, self.num_heads, self.d_k).transpose(1, 2)
v = v.view(B, T, self.num_heads, self.d_k).transpose(1, 2)
# Use the optimized SDPA when available
out = F.scaled_dot_product_attention(q, k, v, is_causal=True)
out = out.transpose(1, 2).contiguous().view(B, T, C)
return self.proj(out)
class MLP(nn.Module):
def __init__(self, d_model, mlp_dim):
super().__init__()
self.fc1 = nn.Linear(d_model, mlp_dim)
self.fc2 = nn.Linear(mlp_dim, d_model)
def forward(self, x):
return self.fc2(F.gelu(self.fc1(x)))
class Block(nn.Module):
def __init__(self, d_model, num_heads, mlp_dim, max_seq_len):
super().__init__()
self.norm1 = nn.LayerNorm(d_model)
self.attn = CausalSelfAttention(d_model, num_heads, max_seq_len)
self.norm2 = nn.LayerNorm(d_model)
self.mlp = MLP(d_model, mlp_dim)
def forward(self, x):
x = x + self.attn(self.norm1(x))
x = x + self.mlp(self.norm2(x))
return x
class GPT(nn.Module):
def __init__(self, vocab_size, d_model=384, num_heads=6, mlp_dim=1536,
num_layers=6, max_seq_len=256):
super().__init__()
self.max_seq_len = max_seq_len
self.tok_embed = nn.Embedding(vocab_size, d_model)
self.pos_embed = nn.Embedding(max_seq_len, d_model)
self.blocks = nn.ModuleList([
Block(d_model, num_heads, mlp_dim, max_seq_len)
for _ in range(num_layers)
])
self.norm = nn.LayerNorm(d_model)
self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
# Weight tying
self.lm_head.weight = self.tok_embed.weight
def forward(self, idx, targets=None):
B, T = idx.shape
pos = torch.arange(T, device=idx.device)
x = self.tok_embed(idx) + self.pos_embed(pos)
for block in self.blocks:
x = block(x)
x = self.norm(x)
logits = self.lm_head(x)
if targets is None:
return logits, None
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
return logits, loss
@torch.no_grad()
def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
for _ in range(max_new_tokens):
idx_cond = idx[:, -self.max_seq_len:]
logits, _ = self(idx_cond)
logits = logits[:, -1, :] / temperature
if top_k is not None:
v, _ = torch.topk(logits, top_k)
logits[logits < v[:, [-1]]] = -float("inf")
probs = F.softmax(logits, dim=-1)
next_tok = torch.multinomial(probs, num_samples=1)
idx = torch.cat((idx, next_tok), dim=1)
return idx
That’s a complete decoder-only GPT. Let’s train it.
Data: character-level TinyShakespeare
import requests
data = requests.get(
"https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
).text
chars = sorted(set(data))
vocab_size = len(chars)
stoi = {c: i for i, c in enumerate(chars)}
itos = {i: c for c, i in stoi.items()}
encoded = torch.tensor([stoi[c] for c in data], dtype=torch.long)
n = int(0.9 * len(encoded))
train, val = encoded[:n], encoded[n:]
Training loop
import torch
torch.manual_seed(0)
device = "cuda" if torch.cuda.is_available() else "cpu"
batch_size = 64
max_seq_len = 256
model = GPT(vocab_size, max_seq_len=max_seq_len).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.1)
def get_batch(split):
data = train if split == "train" else val
ix = torch.randint(len(data) - max_seq_len - 1, (batch_size,))
x = torch.stack([data[i:i + max_seq_len] for i in ix])
y = torch.stack([data[i + 1:i + 1 + max_seq_len] for i in ix])
return x.to(device), y.to(device)
for step in range(5000):
x, y = get_batch("train")
_, loss = model(x, y)
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
if step % 500 == 0:
with torch.no_grad():
xv, yv = get_batch("val")
_, vloss = model(xv, yv)
print(f"step {step}: train {loss.item():.3f} val {vloss.item():.3f}")
After ~5000 steps on a single GPU, you’ll see val loss drop from ~4.0 to ~1.5. Generate:
ctx = torch.zeros((1, 1), dtype=torch.long, device=device)
out = model.generate(ctx, max_new_tokens=500, temperature=0.8, top_k=40)
print("".join(itos[int(i)] for i in out[0]))
You’ll get gibberish that looks structurally like Shakespeare — fake-archaic words, character names, dialogue formatting. The model learned the structure of the text.
What you’ve built
- A character-level transformer language model.
- ~5M parameters, trains in minutes on a single GPU.
- Demonstrates: tokenization, embedding lookup, multi-head attention, position embeddings, residual connections, normalization, MLP blocks, autoregressive sampling.
Scaling up
To turn this toy into something real:
- Real tokenizer. Use BPE (
tiktokenortokenizers) instead of characters. - Bigger data. OpenWebText, FineWeb, anything in the hundreds-of-GB range.
- More parameters. Increase
d_model,num_layers. Watch FLOPs. - RoPE instead of learned positional embeddings.
- GQA instead of vanilla multi-head.
- Mixed precision (bf16) for 2× speedup.
- Distributed training (FSDP) once you outgrow one GPU.
- Longer training, with warmup + cosine LR.
- Eval beyond loss: HellaSwag, MMLU, Lambada — see Stage 13.
By the time you’ve done all this, you’ve reproduced something like a small open-weights model.
Things to try
- Train at multiple sizes. 2-layer, 6-layer, 12-layer. Plot val loss vs parameters. This is your first scaling-laws plot.
- Add KV caching to
generate(). Time the speedup at long generations. - Replace LayerNorm with RMSNorm. Verify quality is unchanged.
- Replace learned pos embeddings with RoPE.
- Implement weight tying off and see how much param count grows.
- Generate with different temperatures and top_p. Notice sampling effects.
Honest reality check
A model that hits val loss 1.5 on TinyShakespeare is infinitely less capable than a real LLM. It can’t:
- Answer questions
- Hold a conversation
- Write code
- Reason
To get those, you need:
- 100,000× more data
- 100–1000× more parameters
- Pretraining + SFT + preference tuning
But mechanically, this toy is the same architecture as Claude or GPT or LLaMA. Everything you build from here is “the same, but more.”
Watch it interactively
- Pipeline — the same architecture (12 layers, decoder-only) running on real GPT-2 small. Watch a prompt flow through the stack: tokenizer → embeddings → 12 layers → logits → sampling.
- Sampling Knobs — Predict before clicking: at temperature 0 the model is deterministic; at T=2 it’s incoherent. The same logits, three knobs (T, top-p, top-k), four very different outputs.
- Beam Search Lab — decoding strategies on real GPT-2 small logits. Watch greedy vs beam vs sampling pick differently from the same distribution.
Build it in code
/build/08— wire up GPT — the full curriculum’s payoff: stack the transformer block N times, train on TinyShakespeare, watch perplexity drop. ~300 lines including training loop./build/12— fine-tune your tiny GPT — once you have a working GPT, LoRA-fine-tune it on a new corpus. ~50 lines on top.