step 04 · build

Embeddings and positional encoding

Two lookup tables — one for token, one for position — added together to start the residual stream.

model

Token IDs are integers. Neural networks operate on continuous vectors. The bridge between the two is the embedding layer — a lookup table that maps each token ID to a learned d_model-dimensional vector.

We need a second piece on top of that: positional encoding. Self-attention has no built-in notion of order — q @ k.T would give the same result if you shuffled the tokens. Without a position signal, “the cat ate the fish” and “the fish ate the cat” would look identical to attention. We fix this by adding a position-dependent vector to every token embedding before the model sees it.

This step builds both pieces in one Embed module, ~25 lines of model code. After it, our (B, T) integer token tensor becomes a (B, T, d_model) float tensor — exactly what step 05’s attention block expects.

What goes in, what comes out

The shape contract:

input:  (batch, seq_len)             — integer token IDs
output: (batch, seq_len, d_model)    — learned float vectors per token

Internally there are two tables of learned vectors:

Token embeddings: nn.Embedding(vocab_size, d_model) — one row per token in the vocabulary
Position embeddings: nn.Embedding(max_seq_len, d_model) — one row per position 0, 1, 2, …

For each position t in the sequence, we add the token’s embedding to the position’s embedding. The sum starts the residual stream — the running representation that flows through every transformer block, getting refined.

Why we add (instead of concat)

This is one of the simpler design choices that surprises people. Concatenating token + position embeddings would give a (B, T, 2·d_model) tensor — preserving both signals but doubling the dimension everywhere downstream.

Adding them gives (B, T, d_model) — same width as the residual stream. We trust the network to learn to disentangle the two signals if it needs to. In practice it does, and you save half the parameters in every layer that follows.

This is the choice every transformer makes. The original “Attention Is All You Need” paper uses fixed sinusoidal positions added to learned token embeddings; GPT-2 uses learned positions added to learned tokens; LLaMA-3 uses RoPE (which mixes position into the query/key projections instead of the residual stream). All variations on “the position signal goes alongside the token signal,” not concatenated.

If the trade-offs feel abstract, the Positional Encoding Lab compares all four schemes — sinusoidal, learned, RoPE, and ALiBi — on the same sequence so you can see how each conveys position differently.

Setup

Create a new file:

# tiny_llm/embed.py
import torch
import torch.nn as nn

That’s the import block. We’re staying in nn.Module land — this is just two nn.Embedding lookups and an add.

The class

# tiny_llm/embed.py
class Embed(nn.Module):
    """Token + learned positional embedding.

    Input:  (B, T) integer token IDs
    Output: (B, T, d_model) float embeddings, ready for the residual stream.
    """

    def __init__(self, vocab_size: int, d_model: int, max_seq_len: int) -> None:
        super().__init__()
        self.d_model = d_model
        self.max_seq_len = max_seq_len

        # One row per token in the vocab. Initialized small (~0.02 std) —
        # this is the GPT-2 / LLaMA convention. Default nn.Embedding init
        # is N(0, 1) which is way too large for transformer training.
        self.tok_emb = nn.Embedding(vocab_size, d_model)
        nn.init.normal_(self.tok_emb.weight, mean=0.0, std=0.02)

        # One row per absolute position 0..max_seq_len-1.
        self.pos_emb = nn.Embedding(max_seq_len, d_model)
        nn.init.normal_(self.pos_emb.weight, mean=0.0, std=0.02)

    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
        B, T = token_ids.shape
        if T > self.max_seq_len:
            raise ValueError(
                f"Sequence length {T} exceeds max_seq_len {self.max_seq_len}"
            )

        # 1. Token embedding: (B, T) → (B, T, d_model)
        tok = self.tok_emb(token_ids)

        # 2. Position indices: same for every batch row, just 0..T-1.
        positions = torch.arange(T, device=token_ids.device)
        pos = self.pos_emb(positions)            # (T, d_model)

        # 3. Add. Broadcasting handles the missing batch dim:
        # (B, T, d_model) + (T, d_model) → (B, T, d_model)
        return tok + pos

Three things worth understanding in detail:

The init scale. nn.init.normal_(..., std=0.02) is the GPT-2 default. PyTorch’s default is much wider; for a transformer that’s about to be trained at depth ≥ 6 with residual connections, the default blows up activations and training never starts. 0.02 is the conventional fix every modern decoder-only LLM uses for both embeddings and linear layers’ weights. We’ll set it again on the linear layers in step 07.

Position indices come from torch.arange(T). There’s no batch dimension on positions because positions don’t depend on the batch — every batch row uses the same 0, 1, 2, …, T-1 sequence. PyTorch broadcasts the resulting (T, d_model) tensor across the batch dimension when we add.

max_seq_len cap. Learned positional embeddings have a fixed table size — you can’t ask the model about position 1024 if you only trained positions 0..1023. We carry max_seq_len as a hyperparameter and let the model tell you (early) if a longer sequence shows up. Sinusoidal and ALiBi both extrapolate; RoPE extrapolates with caveats. Learned absolute positions, what we’re using, do not extrapolate at all.

Sanity check

Add at the bottom of the file:

# tiny_llm/embed.py
if __name__ == "__main__":
    torch.manual_seed(0)

    embed = Embed(vocab_size=4096, d_model=128, max_seq_len=64)
    print(f"params: {sum(p.numel() for p in embed.parameters()):,}")

    # Two sequences, length 12, with random token IDs.
    ids = torch.randint(0, 4096, (2, 12))
    out = embed(ids)

    print(f"\ninput shape:  {tuple(ids.shape)}")
    print(f"output shape: {tuple(out.shape)}")
    print(f"output dtype: {out.dtype}")
    print(f"output stats: mean={out.mean():.4f}, std={out.std():.4f}")

    # Verify position dependence: same token at two different positions
    # should produce different vectors.
    same_id = torch.tensor([[42, 42, 42, 42]])
    same_out = embed(same_id)
    print(f"\nsame token (id=42) at positions 0..3:")
    print(f"  position 0 vs 1 differ: {not torch.allclose(same_out[0, 0], same_out[0, 1])}")
    print(f"  position 0 vs 2 differ: {not torch.allclose(same_out[0, 0], same_out[0, 2])}")

    # Verify token dependence: different tokens at the same position
    # should also differ.
    diff_id = torch.tensor([[10, 20]])
    diff_out = embed(diff_id)
    print(f"\ndifferent tokens at position 0 (id=10) and 1 (id=20):")
    print(f"  vectors differ: {not torch.allclose(diff_out[0, 0], diff_out[0, 1])}")

Run it:

uv run python -m tiny_llm.embed

Expected output (exact stats depend on the random seed):

params: 532,480

input shape:  (2, 12)
output shape: (2, 12, 128)
output dtype: torch.float32
output stats: mean=0.0001, std=0.0283

same token (id=42) at positions 0..3:
  position 0 vs 1 differ: True
  position 0 vs 2 differ: True

different tokens at position 0 (id=10) and 1 (id=20):
  vectors differ: True

What to notice:

std ≈ 0.028. Roughly √2 · 0.02 because we’re adding two N(0, 0.02²) tensors. This std is what AdamW expects to see going into the first layer-norm of step 07; if it were too large, training would diverge in the first ten steps.
Same token, different positions, different output vectors. That’s positional encoding doing its job — the model has a way to tell “this word at position 3” from “the same word at position 5.”
Different tokens, different output vectors. Obvious, but worth confirming — if our token embedding lookup were broken, every word would look identical.
532k parameters. Vocab × d_model + max_seq_len × d_model = 4096·128 + 64·128 = 524288 + 8192. The token embedding dominates at this scale; in real models with 50k vocabs and 4k contexts it’s even more lopsided.

Wire-up: from data to attention

Open up where we are. The pipeline now spans three files:

tiny_llm/
├── tokenize.py     # step 02 — text → token IDs
├── data.py         # step 03 — TinyStories → batches of token IDs
├── embed.py        # step 04 — token IDs → (B, T, d_model) tensor   ← we are here
└── attention.py    # step 05 — (B, T, d_model) → (B, T, d_model)

A quick smoke test that ties the data pipeline to the model so far. Drop this in a scratch file or REPL:

# scratch_smoke_test.py
import torch
from tiny_llm.data import load_token_array, get_batch, DATA_DIR
from tiny_llm.embed import Embed
from tiny_llm.attention import CausalSelfAttention

# Load data
data = load_token_array(DATA_DIR / "valid.bin")
x, y = get_batch(data, batch_size=2, seq_len=12)
print(f"x: {tuple(x.shape)}")

# Embed
embed = Embed(vocab_size=4096, d_model=64, max_seq_len=64)
e = embed(x)
print(f"after embed: {tuple(e.shape)}")

# Attention
attn = CausalSelfAttention(d_model=64, max_seq_len=64)
a = attn(e)
print(f"after attention: {tuple(a.shape)}")

uv run python scratch_smoke_test.py

Expected output:

x: (2, 12)
after embed: (2, 12, 64)
after attention: (2, 12, 64)

You now have the start of an actual transformer. Token IDs go in, embeddings come out, attention transforms them in shape-preserving ways. The next four steps build on top of this exact tensor flow.

What we did and didn’t do

What we did:

Token + learned absolute positional embedding in one nn.Module
Proper init (std=0.02) so the residual stream stats start at the right scale
A position-dependence sanity check
Confirmed end-to-end that data → embed → attention runs cleanly

What we didn’t:

RoPE or ALiBi. Modern models use rotary position embeddings (LLaMA, GPT-NeoX, Qwen) or linear attention biases (BLOOM). Our learned absolute positions are simpler and good enough at our scale. The Positional Encoding Lab compares all four if you’re curious.
Tied input/output embeddings. Many transformers tie the token embedding’s weight matrix with the output projection (lm_head) — same parameters used to embed input and to project output to logits. We’ll do this in step 08 when we assemble the full GPT.
Embedding dropout. Dropout on embeddings during training is a regularizer some models use. We don’t bother — TinyStories is small enough that we have other regularization knobs (weight decay, early stopping) without adding complexity here.
Token-type embeddings. BERT and friends have a third embedding for “is this segment A or segment B”; decoder-only models don’t need it.

Step 05 (which you may have already read) takes the (B, T, d_model) tensor we just produced and applies single-head causal self-attention. With the data pipeline + embeddings done, that’s the first half of the model — every later step composes more pieces on top of these.

If you read step 05 already as part of the format spike: re-running the sanity check there with d_model=64 and max_seq_len=64 should match what we just produced. If you haven’t read it yet, head there next.