tiny-llm 02 / 16 25 min read · 35 min hands-on

step 02 · build

Build a BPE tokenizer

Train your own byte-pair encoding from scratch — no external library, ~90 lines of Python.

modeltokenizer

Before any neural network sees your text, a tokenizer has to chop it into integer IDs. The choice of tokenizer is one of the most consequential design decisions in modern LLMs — it determines vocabulary size, sequence length, and even what the model can express.

We’re going to write one from scratch. Specifically: byte-pair encoding (BPE), the algorithm behind GPT-2, GPT-4, LLaMA, and most production tokenizers. By the end you’ll have a BPETokenizer class that trains on any text corpus and round-trips strings to integer IDs and back — about 90 lines of Python with zero external dependencies beyond the standard library.

What BPE actually does

BPE starts from a very simple premise: the most frequent adjacent pairs of characters should be merged into single tokens. Repeat that until you hit your target vocab size.

Concretely, the training algorithm:

  1. Initialize the vocabulary as the set of unique characters in the corpus.
  2. Split the corpus into “words” (we’ll use whitespace splits).
  3. Represent each word as a list of single characters.
  4. Count the frequency of every adjacent pair across all words.
  5. Find the most frequent pair (a, b).
  6. Add a new token ab to the vocabulary. Record this as a merge rule.
  7. Replace every (a, b) pair in every word with the new token ab.
  8. Repeat 4–7 until vocab is the desired size.

The output: a vocabulary (token strings → integer IDs) and an ordered list of merge rules. To encode new text later, we apply the merge rules in the order they were learned. To decode integer IDs, we look up each token string and concatenate.

That’s it. The whole algorithm in eight lines of pseudocode. The interesting bit is implementing it efficiently enough to be usable, which is what we’ll do.

If you’ve used the Tokenizer Surgery demo, you’ve seen the output of this exact algorithm running at scale on real corpora — GPT-2’s tokenizer is a BPE trained on web text with ~50k merges. Ours will be tiny by comparison, but the same shape.

Setup

Create a new file:

# tiny_llm/tokenize.py
from __future__ import annotations
from collections import Counter
from typing import Iterable

That’s our entire import block — Counter for fast frequency counts, Iterable for type hints. No external deps.

The class skeleton

# tiny_llm/tokenize.py
class BPETokenizer:
    """
    A minimal byte-pair-encoding tokenizer.

    Train it on a corpus to learn a vocabulary; encode/decode arbitrary
    text to/from lists of integer IDs.
    """

    # Special tokens that should never be merged or split.
    SPECIAL_TOKENS = ["<|pad|>", "<|bos|>", "<|eos|>", "<|unk|>"]

    def __init__(self) -> None:
        # token string -> integer id
        self.vocab: dict[str, int] = {}
        # the inverse, populated after training
        self.id_to_token: dict[int, str] = {}
        # ordered list of (a, b) -> "ab" merge rules
        self.merges: list[tuple[str, str]] = []

    @property
    def vocab_size(self) -> int:
        return len(self.vocab)

Three pieces of state: vocab maps token strings to integer IDs, id_to_token is the inverse for decoding, merges records the order we learned merge rules in (this order matters at encode time).

The SPECIAL_TOKENS list is the conventional set every modern tokenizer carries: a padding token, a beginning-of-sequence marker, an end-of-sequence marker, and an unknown-token fallback. We’ll wire them into the vocab first so they always have low, stable IDs.

Training

The train method is the heart. It implements the algorithm we sketched above:

# tiny_llm/tokenize.py (continuing the class)
    def train(self, corpus: str, vocab_size: int) -> None:
        """Learn `vocab_size` tokens from the given corpus.

        Strategy: start from the unique characters in the corpus, then
        repeatedly merge the most frequent adjacent pair until we hit
        `vocab_size` total tokens (including specials and chars).
        """
        # 1. Pre-tokenize the corpus into "words". Whitespace splits.
        # Each word becomes a list of single-character tokens.
        words: dict[tuple[str, ...], int] = Counter()
        for line in corpus.splitlines():
            for raw in line.split():
                # End each word with a special boundary marker so encode
                # later can tell where words start/end. GPT-2 prepends
                # the space; we'll append a sentinel for clarity.
                tup = tuple(raw) + ("</w>",)
                words[tup] += 1

        # 2. Seed the vocab: specials first, then every unique character
        #    we saw, then the </w> sentinel.
        chars: set[str] = {"</w>"}
        for word in words:
            for ch in word:
                chars.add(ch)

        self.vocab = {tok: i for i, tok in enumerate(self.SPECIAL_TOKENS)}
        for ch in sorted(chars):
            if ch not in self.vocab:
                self.vocab[ch] = len(self.vocab)
        self.merges = []

        # 3. Repeatedly merge the most frequent pair.
        while len(self.vocab) < vocab_size:
            pair_counts = self._count_pairs(words)
            if not pair_counts:
                break  # nothing left to merge

            best_pair, best_count = pair_counts.most_common(1)[0]
            if best_count < 2:
                break  # no pair appears more than once — stop

            new_token = best_pair[0] + best_pair[1]
            self.vocab[new_token] = len(self.vocab)
            self.merges.append(best_pair)
            words = self._apply_merge(words, best_pair, new_token)

        # 4. Build the inverse map for decoding.
        self.id_to_token = {i: t for t, i in self.vocab.items()}

    @staticmethod
    def _count_pairs(words: dict[tuple[str, ...], int]) -> Counter:
        """Count every adjacent pair across all words, weighted by frequency."""
        pairs: Counter = Counter()
        for word, freq in words.items():
            for a, b in zip(word, word[1:]):
                pairs[(a, b)] += freq
        return pairs

    @staticmethod
    def _apply_merge(
        words: dict[tuple[str, ...], int],
        pair: tuple[str, str],
        new_token: str,
    ) -> dict[tuple[str, ...], int]:
        """Replace every occurrence of `pair` with `new_token` across all words."""
        merged: dict[tuple[str, ...], int] = Counter()
        a, b = pair
        for word, freq in words.items():
            new_word: list[str] = []
            i = 0
            while i < len(word):
                if i < len(word) - 1 and word[i] == a and word[i + 1] == b:
                    new_word.append(new_token)
                    i += 2
                else:
                    new_word.append(word[i])
                    i += 1
            merged[tuple(new_word)] += freq
        return merged

A few things worth calling out:

The </w> sentinel. Every word ends with </w> so the tokenizer can distinguish “ed” mid-word from “ed” at a word boundary. GPT-2 uses a different convention (a leading space character Ġ), but the underlying need — encoding word boundaries — is the same.

Stopping early. Two break conditions: no pairs left, or the most frequent pair appears only once. Once every pair is unique, merging adds vocab without compressing anything.

Counter-of-tuples. We represent each word as a tuple of token strings. After a merge, the tuple gets shorter. Repeated identical words are deduplicated via the Counter’s freq weight — much faster than processing each occurrence separately.

This is the same technique the original BPE paper (Sennrich et al., 2015) used. Production tokenizers add tricks like Unicode-byte fallback and regex-based pre-tokenization, but the algorithm at the heart is identical.

Encoding

Now: given trained merges and vocab, turn arbitrary text into integer IDs.

# tiny_llm/tokenize.py (continuing the class)
    def encode(self, text: str) -> list[int]:
        """Encode text as a list of integer token IDs."""
        ids: list[int] = []
        for line in text.splitlines():
            for raw in line.split():
                tokens = self._tokenize_word(raw)
                for tok in tokens:
                    ids.append(self.vocab.get(tok, self.vocab["<|unk|>"]))
        return ids

    def _tokenize_word(self, word: str) -> list[str]:
        """Apply learned merges to a single word, in order."""
        # Start as a list of single chars + the </w> sentinel
        tokens = list(word) + ["</w>"]

        # Apply each merge rule in the order it was learned. The first
        # merge that fires changes the token list; we restart the scan.
        for a, b in self.merges:
            i = 0
            new_tokens: list[str] = []
            while i < len(tokens):
                if i < len(tokens) - 1 and tokens[i] == a and tokens[i + 1] == b:
                    new_tokens.append(a + b)
                    i += 2
                else:
                    new_tokens.append(tokens[i])
                    i += 1
            tokens = new_tokens
        return tokens

The key insight: merges must be applied in training order. The first merge we learned is (t, h)th; if we apply a later merge like (h, e)he first, we’d never form the even if (th, e) was a learned merge. Order is everything.

This is the slow-but-correct version. Production tokenizers use a priority queue + linked list to encode in O(n log n) per word; ours is O(n × len(merges)), which is fine for our scale.

Decoding

The inverse operation is straightforward: look up each ID, concatenate, replace </w> with a space:

# tiny_llm/tokenize.py (continuing the class)
    def decode(self, ids: Iterable[int]) -> str:
        """Decode integer token IDs back to a string."""
        tokens = [self.id_to_token.get(i, "<|unk|>") for i in ids]
        text = "".join(tokens).replace("</w>", " ")
        return text.strip()

Sanity check

Add a test block at the bottom of the file:

# tiny_llm/tokenize.py (bottom of file)
if __name__ == "__main__":
    # A toy corpus. Repetition matters — BPE learns the most frequent pairs.
    corpus = """
    the cat sat on the mat
    the dog sat on the log
    the cat and the dog and the mat
    a dog and a cat and a log
    """ * 5

    tok = BPETokenizer()
    tok.train(corpus, vocab_size=40)

    print(f"vocab size: {tok.vocab_size}")
    print(f"merges learned: {len(tok.merges)}")
    print(f"first 8 merges: {tok.merges[:8]}")

    # Round-trip a sentence.
    sample = "the cat sat on a log"
    ids = tok.encode(sample)
    decoded = tok.decode(ids)

    print(f"\noriginal:  {sample!r}")
    print(f"ids:       {ids}")
    print(f"decoded:   {decoded!r}")
    print(f"matches:   {decoded == sample}")

Run it:

uv run python -m tiny_llm.tokenize

Expected output (the exact merges depend on tie-breaking but the shape is stable):

vocab size: 40
merges learned: 22
first 8 merges: [('t', 'h'), ('th', 'e'), ('the', '</w>'), ('a', 't'),
                 ('a', 'n'), ('an', 'd'), ('and', '</w>'), ('o', 'g')]

original:  'the cat sat on a log'
ids:       [21, 28, 12, 7, 13, 8, 13, 12, 6, 22, 13, 5, 23]
decoded:   'the cat sat on a log'
matches:   True

What to notice:

  • ('t', 'h') is learned first. Across the toy corpus, “th” is the most common adjacent pair (every “the” contributes one). After the merge, “the” is one token away from being a real word.
  • ('the', '</w>') learned third. Now the entire word “the” is a single token. Almost all of our space savings come from these word-level merges on common words.
  • The round-trip succeeds. decode(encode(x)) == x is the basic correctness property; if you broke it, BPE is unusable.

Why this matters

Three details that make BPE the right default for LLMs:

Open-vocabulary by construction. BPE never produces an out-of-vocabulary token (in practice). Worst case, an unknown character gets emitted as itself — a single-character token. That’s why GPT-4 can read your novel in Tolkien’s Elvish even though Elvish wasn’t in its training set.

Compression is meaningful. A typical BPE-tokenized English text is roughly 4 characters per token. So a 1024-token context window holds ~4000 characters of English. Compression varies by language and domain — code is denser, Chinese is much denser. The Tokenizer Surgery demo lets you compare GPT-2, GPT-4, and GPT-4o on the same input.

The vocab is small. GPT-2 had 50k tokens. LLaMA-3 has 128k. Compare to a million-word dictionary or all of Unicode. The smaller vocab means a smaller embedding matrix and a smaller output projection, both of which scale with vocab_size × d_model.

What we did and didn’t do

What we did:

  • Implemented training, encoding, and decoding for byte-pair encoding
  • Special-token handling
  • Round-trip correctness on a toy corpus
  • ~90 lines of Python, no dependencies beyond collections.Counter

What we didn’t:

  • Byte-level fallback. Real GPT-style tokenizers operate on bytes, not Unicode characters, so they can encode literally anything. We work on characters; if our test corpus has a character not in our vocab, we emit <|unk|>. Step 03 will work around this by training on the same data we test on.
  • Regex pre-tokenization. GPT-2 splits text by a famous regex ('s, 't, contractions, punctuation) before BPE sees it. We just split on whitespace. Adequate for TinyStories, not for code or prose-with-punctuation.
  • Speed. Our encode is O(text × merges); production tokenizers (HuggingFace Tokenizers, tiktoken) are 100–1000× faster. We don’t care — we’ll tokenize the dataset once and save the result.
  • Subword regularization. Some training tricks (BPE-dropout, unigram LM) randomize tokenization at training time; we don’t.

Next

Step 03 takes this tokenizer to TinyStories: download the dataset, train the BPE on it, encode the whole thing once, and save it as a numpy array of token IDs ready for the training loop in step 09. We’ll also build the (input, target) batching logic that turns a flat array of IDs into the pairs the model learns from.