step 01 · build

The math you actually need

Five ideas — vectors, matmul, softmax, cross-entropy, gradients — and where each one shows up in our code.

setup

You’ve probably seen most of this before. This article isn’t trying to teach linear algebra or backprop from zero — it’s lining up the working vocabulary you’ll need to read the next 15 steps without stopping to look things up. Five concepts, in the order they’ll appear in our code.

If any of them is genuinely rusty, the theory articles on this site cover each in depth, and I’ll link the relevant one as we go.

1. Vectors and matrices

A vector is an ordered list of numbers. In this curriculum, every “thing” we deal with is a vector at some point — a token’s embedding is a vector, the hidden state inside the transformer is a vector, the output logits over the vocabulary are a vector.

A matrix is a 2D array. PyTorch generalizes both into the tensor — an N-dimensional array — and that’s the only data structure we’ll really use.

import torch

x = torch.tensor([1.0, 2.0, 3.0])      # shape (3,)        — a vector
W = torch.eye(4)                        # shape (4, 4)      — a matrix
batch = torch.randn(8, 64, 128)         # shape (8, 64, 128)

The most common shape in our model code is (batch, seq_len, d_model) — batch independent sequences, each of seq_len tokens, each token represented by a d_model-dimensional vector.

If geometric intuition feels rusty (vectors as arrows, dot products as projections), the Linear Algebra Lab makes it concrete in 2D — drag two vectors, watch the dot product update.

2. Matrix multiplication is what neural networks ARE

Strip away the activation functions and a neural network is just a stack of matrix multiplications. The @ operator does the multiply in PyTorch:

x = torch.randn(8)             # input vector, shape (8,)
W = torch.randn(8, 16)         # weight matrix
y = x @ W                      # output, shape (16,) — projected from 8-d to 16-d

For PyTorch’s nn.Linear, the weight is stored as (out_features, in_features) and the call computes x @ W.T + b for you. What you need to remember:

A Linear layer projects from one dimension to another
Input shape (..., in_dim) becomes output shape (..., out_dim)
The (...) placeholder means “any number of leading batch dimensions” — PyTorch broadcasts automatically

In step 05, the line q = self.W_q(x) is one of these. x has shape (B, T, D), the layer is nn.Linear(D, D), the result q has shape (B, T, D). Same shape, but every vector now lives in a different (learned) coordinate system — the “query” space.

There’s one more matmul you’ll see constantly: batched matmul with @ between rank-3 tensors:

A = torch.randn(2, 4, 8)        # shape (B, T, D)
B = torch.randn(2, 8, 4)        # shape (B, D, T)
C = A @ B                       # shape (B, T, T) — multiplies the inner two dims

That’s exactly what q @ k.transpose(-2, -1) did inside attention.

3. Softmax

softmax takes a vector of unbounded real numbers and turns it into a probability distribution. The formula:

softmax(x)[i] = exp(x[i]) / Σⱼ exp(x[j])

After softmax, every entry is in [0, 1] and they sum to 1. We use it in two places:

Inside attention: turn raw query·key scores into “how much I attend to each previous token”
At the model’s output: turn logits over the vocabulary into “probability of each next token”

scores = torch.tensor([2.0, 1.0, 0.1])
probs = torch.softmax(scores, dim=-1)    # tensor([0.659, 0.242, 0.099])
probs.sum()                              # tensor(1.0)

The temperature knob (step 10): divide scores by some T > 1 before softmax to make the distribution flatter, or T < 1 to sharpen it. The Sampling Knobs demo shows this on real GPT-2 logits — drag temperature and watch the distribution reshape.

4. Cross-entropy

Cross-entropy is the loss function we’ll minimize. For one training example it asks: how surprised was the model that the right token was actually right?

H(y_true, y_pred) = − Σᵢ y_true[i] · log(y_pred[i])

For language modeling, y_true is one-hot — exactly one token is the correct answer — so this collapses to:

loss = − log(probability the model assigned to the correct token)

Confident and right? Small loss. Confident and wrong? Big loss. Distribution close to uniform? Loss roughly equal to log(vocab_size).

PyTorch’s nn.CrossEntropyLoss combines softmax + cross-entropy in a numerically stable way. We use it directly on the model’s pre-softmax logits — never call softmax during training:

loss_fn = torch.nn.CrossEntropyLoss()
logits = model(input_ids)                       # (B, T, vocab_size), pre-softmax
loss = loss_fn(
    logits.view(-1, vocab_size),                # flatten to (B*T, vocab_size)
    targets.view(-1),                           # flatten to (B*T,)
)

If entropy / cross-entropy / KL still feel fuzzy, the Entropy & KL Divergence demo makes all three tangible — drag two distributions and watch each metric respond.

5. Gradients (one paragraph)

PyTorch’s autograd builds a computational graph as you do tensor ops, then loss.backward() walks the graph backward applying the chain rule, populating each parameter’s .grad attribute. The optimizer (we’ll use AdamW) reads those gradients and nudges each parameter to reduce the loss.

That’s literally the whole training game:

optimizer.zero_grad()       # clear stale gradients from the last step
logits = model(inputs)      # forward pass — autograd records the graph
loss = loss_fn(logits, targets)
loss.backward()             # backward pass — chain rule fills .grad fields
optimizer.step()            # update parameters using .grad

The animated backprop walkthrough shows the chain rule running on a tiny graph if you want to see it move. The animated gradient descent demo shows the optimizer descending a loss surface. Neither is strictly required — just trust that .backward() plus .step() does the right thing.

What we deliberately won’t be using

A few things you’d see in a deep-learning textbook that we genuinely don’t need to write a working LLM:

Hand-derived gradients. Autograd handles every gradient. You will not write a single derivative by hand in this curriculum.
Detailed backprop algebra. Same reason — we’ll say “the gradient flows backward through this op” without computing its exact form.
Convex optimization theory. Our loss landscape is non-convex; SGD-with-momentum minimizes it well enough in practice. We use AdamW and stop worrying.
Variational reparameterization tricks. Decoder-only transformers don’t need them.

If a paper you’re reading uses any of those, fine — but they aren’t on the critical path.

Where to go deeper

If anything in the next 15 steps feels mathematically shaky, the matching theory articles are the first places to look:

Step 02 starts the actual implementation. We’ll build a BPE tokenizer from scratch — about 80 lines of Python that turn raw text into the integer streams every neural language model actually consumes. No external tokenizer libraries; we’ll see exactly how the tokenizers behind GPT-2, GPT-4, and LLaMA work under the hood.