step 03 · build

Tokenize TinyStories, build batches

Download the dataset, train the BPE on it, save token IDs once, and yield (input, target) batches the model can learn from.

modeldata

We have a tokenizer. Now we need text to feed it. This step picks our training corpus, tokenizes it once, saves the result, and writes the batching logic the training loop will pull from.

By the end you’ll have:

TinyStories downloaded to a local file (~50 MB compressed)
The BPE tokenizer from step 02 trained on it
The whole dataset encoded as a single numpy array of token IDs, saved to disk
A get_batch(split, batch_size, seq_len) function that yields (input, target) tensors ready for model(input) and loss(logits, target)

That’s the entire data pipeline. Once it’s in place, every later step just calls get_batch and ignores the rest.

Why TinyStories

TinyStories is a synthetic dataset of short children’s stories generated by GPT-3.5 and GPT-4. Two reasons it’s the right choice for this curriculum:

The vocabulary is small — words a four-year-old understands. A model with a few million parameters can learn it well enough to produce coherent output. Larger corpora would require larger models that we can’t train on a laptop.
The structure is simple — every story has characters, a setting, an action, sometimes a moral. A model that learns this is demonstrably “doing language modeling” without needing to memorize Wikipedia or write Python.

For one-tenth of the work, you get a model that produces output you can read and judge. We use it for the rest of the curriculum.

If you want to substitute your own corpus later (Shakespeare, your own writing, code), the data pipeline below is structured so swapping the source is a 5-line change.

Setup

Add numpy and requests to the project:

uv add numpy requests

Create the data module:

# tiny_llm/data.py
from __future__ import annotations
import os
import urllib.request
from pathlib import Path
import numpy as np
import torch

from tiny_llm.tokenize import BPETokenizer

We’re using urllib.request instead of requests to keep the install slim. numpy is for the tokenized array (faster + smaller on disk than a Python list of ints).

Step 1: download

TinyStories has a clean train/validation split published on HuggingFace. We grab the smaller valid split first to develop against, then train on the full thing later.

# tiny_llm/data.py
DATA_DIR = Path(__file__).parent.parent / "data"
DATA_DIR.mkdir(exist_ok=True)

# We use HuggingFace's hosted version of TinyStories. Single-file
# downloads keep this dependency-free.
TRAIN_URL = "https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-train.txt"
VALID_URL = "https://huggingface.co/datasets/roneneldan/TinyStories/resolve/main/TinyStoriesV2-GPT4-valid.txt"


def download_file(url: str, dest: Path) -> None:
    """Download `url` to `dest` if it doesn't already exist."""
    if dest.exists():
        print(f"  ✓ {dest.name} already exists ({dest.stat().st_size / 1e6:.1f} MB)")
        return
    print(f"  ↓ downloading {url} → {dest.name}")
    urllib.request.urlretrieve(url, dest)
    print(f"    {dest.stat().st_size / 1e6:.1f} MB")


def download_tinystories() -> tuple[Path, Path]:
    """Download train + valid splits, return paths to both."""
    train_path = DATA_DIR / "tinystories_train.txt"
    valid_path = DATA_DIR / "tinystories_valid.txt"
    download_file(VALID_URL, valid_path)   # smaller, get it first
    download_file(TRAIN_URL, train_path)
    return train_path, valid_path

The valid split is ~20 MB and downloads in seconds. The train split is ~2 GB; if you’re on a flaky connection or just want to validate the pipeline, comment out the train download for now.

Step 2: train the tokenizer

The tokenizer needs to see real text to learn good merges. We train on a sample of the train split — full corpus would take a while, and 10 MB of stories is plenty to learn the common merges.

# tiny_llm/data.py
def train_tokenizer(train_path: Path, vocab_size: int = 4096, sample_mb: int = 10) -> BPETokenizer:
    """Train a BPE tokenizer on a sample of the corpus.

    Args:
        train_path: path to the raw text file.
        vocab_size: target vocabulary size (we'll use 4096 for the tiny model).
        sample_mb: how many MB of text to use for tokenizer training.
    """
    print(f"  ↻ training BPE on {sample_mb} MB of {train_path.name}, target vocab = {vocab_size}")
    with open(train_path, "r", encoding="utf-8") as f:
        sample = f.read(sample_mb * 1024 * 1024)
    tok = BPETokenizer()
    tok.train(sample, vocab_size=vocab_size)
    print(f"    learned {len(tok.merges)} merges, vocab size {tok.vocab_size}")
    return tok

A vocab of ~4k is small enough that our embedding matrix stays tiny (and we can iterate on toy models fast) but large enough that common English words become single tokens. GPT-2 used 50k; LLaMA-3 uses 128k. Our 4k is suited to the 4-year-old vocabulary in TinyStories.

Step 3: encode and save

Once the tokenizer is trained, we encode the entire corpus and save it as a binary numpy array. After this step we never look at the raw text again — training reads the array directly.

# tiny_llm/data.py
def encode_and_save(tok: BPETokenizer, src_path: Path, dst_path: Path) -> None:
    """Encode the full text file at src_path, save IDs to dst_path as uint16.

    Why uint16: our vocab fits in 16 bits (4096 < 65536). Halves disk space
    vs the default int64. uint16 also lets numpy mmap directly later.
    """
    if dst_path.exists():
        print(f"  ✓ {dst_path.name} already exists ({dst_path.stat().st_size / 1e6:.1f} MB)")
        return
    print(f"  ↻ encoding {src_path.name} → {dst_path.name}")
    with open(src_path, "r", encoding="utf-8") as f:
        text = f.read()
    ids = tok.encode(text)
    arr = np.array(ids, dtype=np.uint16)
    arr.tofile(dst_path)
    print(f"    {len(ids):,} tokens, {dst_path.stat().st_size / 1e6:.1f} MB")

A note on dtype=np.uint16: it’s a 2× space win and lets us memory-map the file at training time, which means even multi-GB corpora can be handled by tiny-RAM machines. With our 4k vocab, IDs fit comfortably.

Step 4: get_batch — the only function the trainer calls

This is the function step 09’s training loop will call thousands of times. It picks batch_size random starting positions in the token array and slices out windows of seq_len + 1 tokens, then splits each window into (input, target) where target = input shifted by 1.

# tiny_llm/data.py
def load_token_array(path: Path) -> np.ndarray:
    """Memory-map the saved token IDs as a flat uint16 array."""
    return np.memmap(path, dtype=np.uint16, mode="r")


def get_batch(
    data: np.ndarray,
    batch_size: int,
    seq_len: int,
    device: str = "cpu",
) -> tuple[torch.Tensor, torch.Tensor]:
    """Sample `batch_size` random windows of length `seq_len + 1`.

    Returns:
        x: (batch_size, seq_len) — input tokens
        y: (batch_size, seq_len) — target tokens, x shifted by 1

    The model's job for each token in x is to predict the corresponding
    token in y. Cross-entropy on (logits, y) is our training signal.
    """
    # Pick random starting positions. We need seq_len + 1 tokens per
    # window so the last input token has a target.
    starts = np.random.randint(0, len(data) - seq_len - 1, size=batch_size)

    # Gather the windows. List comprehension + np.stack is plenty fast
    # at the scales we'll work with.
    x = np.stack([data[s : s + seq_len].astype(np.int64) for s in starts])
    y = np.stack([data[s + 1 : s + seq_len + 1].astype(np.int64) for s in starts])

    # numpy → torch. We cast to int64 because nn.Embedding requires it.
    x_t = torch.from_numpy(x).to(device)
    y_t = torch.from_numpy(y).to(device)
    return x_t, y_t

Two design choices that aren’t obvious:

Random starts, not sequential. Each batch is independent random samples. This gives us much better gradient diversity than walking the corpus front-to-back, and lets each “epoch” mean nothing in particular — we just train for a fixed number of steps. Modern LLM training all works this way.
Memory-mapping. np.memmap doesn’t load the whole file into RAM. The OS pages in chunks as we read random windows. Means we can train against the full 2 GB corpus from a 1 GB laptop.

The (x, y) shift is the entire teaching signal: for every position t, the model sees tokens 0..t and must predict token t+1. That’s the autoregressive language modeling objective — same one used by every GPT-style model.

Putting it together

A prepare() entry point that runs all four phases:

# tiny_llm/data.py
def prepare(vocab_size: int = 4096, sample_mb: int = 10) -> BPETokenizer:
    """Run the full data pipeline. Idempotent — re-run to verify."""
    train_path, valid_path = download_tinystories()
    tok = train_tokenizer(train_path, vocab_size=vocab_size, sample_mb=sample_mb)
    encode_and_save(tok, train_path, DATA_DIR / "train.bin")
    encode_and_save(tok, valid_path, DATA_DIR / "valid.bin")
    return tok

Now the sanity check. Add at the bottom of the file:

# tiny_llm/data.py
if __name__ == "__main__":
    tok = prepare()

    print("\n--- batch sample ---")
    valid = load_token_array(DATA_DIR / "valid.bin")
    x, y = get_batch(valid, batch_size=2, seq_len=12)
    print(f"x shape: {tuple(x.shape)},  y shape: {tuple(y.shape)}")

    # Decode the first batch entry to verify the (x, y) shift.
    print(f"\nx[0] tokens: {x[0].tolist()}")
    print(f"y[0] tokens: {y[0].tolist()}")
    print(f"x[0] decoded: {tok.decode(x[0].tolist())!r}")
    print(f"y[0] decoded: {tok.decode(y[0].tolist())!r}")
    print(f"\n→ y is x shifted by 1: {x[0, 1:].tolist() == y[0, :-1].tolist()}")

Run it:

uv run python -m tiny_llm.data

The first run will download both splits (large) and train the tokenizer; subsequent runs are instant because of the existence checks.

Expected output (truncated, exact tokens depend on what was sampled):

  ✓ tinystories_valid.txt already exists (19.3 MB)
  ↓ downloading https://huggingface.co/.../TinyStoriesV2-GPT4-train.txt → tinystories_train.txt
    1832.4 MB
  ↻ training BPE on 10 MB of tinystories_train.txt, target vocab = 4096
    learned 4060 merges, vocab size 4096
  ↻ encoding tinystories_train.txt → train.bin
    472,113,920 tokens, 944.2 MB
  ↻ encoding tinystories_valid.txt → valid.bin
    4,872,406 tokens, 9.7 MB

--- batch sample ---
x shape: (2, 12),  y shape: (2, 12)

x[0] tokens: [3411, 12, 21, 1248, 8, 213, 1402, 12, 1011, 8, 213, 442]
y[0] tokens: [12, 21, 1248, 8, 213, 1402, 12, 1011, 8, 213, 442, 8]
x[0] decoded: 'lily was the kind of girl who liked the'
y[0] decoded: 'was the kind of girl who liked the way'

→ y is x shifted by 1: True

What to notice:

y[0] is x[0] shifted by one position. The teaching signal lives in this off-by-one — at every position, the model sees the prefix and must predict the next token.
The decoded text reads like a TinyStories sentence. That’s the dataset working as advertised — small vocabulary, simple sentences, no exotic structure.
One token per ~4 characters. Our 4k vocab compresses English about as well as GPT-2’s 50k vocab does — at this scale, the marginal returns from a bigger vocab are small.

What we did and didn’t do

What we did:

End-to-end data pipeline: download → train tokenizer → encode → save → batch
Memory-mapped tokenized array so the trainer doesn’t need to hold the full corpus in RAM
(x, y) pairs with the autoregressive shift, ready for nn.CrossEntropyLoss
A single prepare() entry point that’s idempotent

What we didn’t:

Train/valid leakage check. We trust HuggingFace’s split. For a real production setup you’d verify the sets are disjoint.
Streaming tokenization. We hold the encoded train.bin (~1 GB) in memory-map, which is fine. For multi-TB corpora you’d want a streaming tokenizer.
Distributed/sharded data loading. Single-machine assumption. Real LLM training shards across hundreds of GPUs; we have one.

Step 04 implements the embedding layer — the first piece of actual model code. Token IDs become vectors via a learned lookup, position information gets added on top, and we have a tensor of shape (B, T, d_model) ready for the attention block we already built.