Tokenization

A language model doesn’t see characters or words — it sees integer tokens. The tokenizer is the layer that decides how text becomes those integers, and it has profound consequences for what the model can do.

The tradeoff

Granularity of the tokenizer:

Approach	Pros	Cons
Character-level	Tiny vocabulary; no out-of-vocab	Long sequences; hard to learn meaning
Word-level	Short sequences; words are meaningful units	Huge vocab; can’t handle unseen words
Subword (BPE/SentencePiece)	Compromise — common words are one token, rare words break apart	Slightly arbitrary boundaries

Modern LLMs all use subword tokenization.

Byte Pair Encoding (BPE)

Originally a compression algorithm (Gage 1994), repurposed for NLP by Sennrich et al. (2015).

Train by:

Start with a vocabulary of all individual characters/bytes.
Find the most frequent pair of adjacent tokens in the corpus.
Merge them into a new token; add to vocabulary.
Repeat until you hit your target vocabulary size (e.g. 50k–200k).

Encode by greedily applying the learned merges to a new string.

"unbelievable" might tokenize to: ["un", "believ", "able"]
"happy" might be one token: ["happy"]
"happiness" might be: ["happi", "ness"]

Used by GPT-2, GPT-3, GPT-4 (with byte-level BPE), Claude, LLaMA.

Byte-level BPE

GPT-2 introduced byte-level BPE — operating on UTF-8 bytes rather than Unicode code points. Benefits:

Vocabulary covers all of Unicode without a special UNK token.
Handles arbitrary text including emoji, symbols, mixed languages.
~256 base tokens (one per byte) before any merges.

This is the standard for OpenAI’s tiktoken library and most modern LLMs.

SentencePiece

Google’s library that implements BPE and Unigram tokenization. Key features:

Treats input as a raw byte stream — no language-specific preprocessing required.
Reversible — encode then decode gives back the original.
Uses a sentinel token (often ▁) to mark word boundaries explicitly.

Used by T5, ALBERT, and many multilingual models.

WordPiece

Used by BERT. Similar to BPE but uses a different merge criterion (likelihood-based rather than frequency-based). Words are tokenized greedily, with a ## prefix indicating “this token continues a word.”

"unaffable" → ["un", "##aff", "##able"]

Unigram language model

Treat the tokenizer as a language model over subwords. Train by selecting a vocabulary that maximizes the probability of the corpus when each token is selected probabilistically. Used by SentencePiece’s Unigram mode and by some recent models.

Special tokens

Every tokenizer has a few reserved tokens for purposes other than text:

Token	Purpose
`[BOS]`, `<s>`, `<\|begin_of_text\|>`	Beginning of sequence
`[EOS]`, `</s>`, `<\|eot_id\|>`	End of sequence
`[PAD]`, `<pad>`	Padding to a fixed length
`[UNK]`	Unknown token (rare in modern tokenizers)
`[CLS]`, `[SEP]`	Classification, separator (BERT)
`<\|im_start\|>`, `<\|im_end\|>`	Chat message boundaries
`<\|tool_call\|>`	Function call delimiter
`<\|fim_prefix\|>`, `<\|fim_middle\|>`, `<\|fim_suffix\|>`	Fill-in-the-middle (code models)

Chat templates are basically formats that use these special tokens to mark roles (system, user, assistant) and turn boundaries.

Practical effects of tokenization

Token counting

API pricing and context windows are measured in tokens, not characters or words.

Rough rule of thumb for English:

1 token ≈ 4 characters
1 token ≈ 0.75 words
100 tokens ≈ 75 words ≈ 1 paragraph

For other languages: substantially more tokens per character. Chinese, Korean, and Japanese are ~2–3× English’s token-per-meaning ratio. This makes them more expensive on the API.

import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
tokens = enc.encode("Hello, world!")
print(len(tokens))   # 4

Numbers and arithmetic

Tokenizers split numbers in surprising ways:

“12345” might be ["12", "345"] or ["1", "23", "45"].
“1,234” and “1234” have different token counts.

This is one reason LLMs are sometimes bad at arithmetic. Modern models partially work around this by tokenizing digits one-by-one (LLaMA-3 does this) or by using tools.

Whitespace

Most tokenizers attach leading whitespace to the following word: " the" is a different token than "the". This affects prompts — “Continue this:\nThe” vs “Continue this:\n The” produce slightly different tokenizations.

Code

Code-tuned tokenizers handle indentation, common keywords, and operators efficiently. A general-purpose tokenizer often wastes tokens on whitespace runs and bracket pairs.

Inspecting tokenization

Use a tokenizer playground:

import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
text = "Hello, world! 你好。"
tokens = enc.encode(text)
for t in tokens:
    print(t, repr(enc.decode([t])))

Or web tools like OpenAI’s tokenizer page or tiktokenizer.vercel.app. Look at how your prompt actually tokenizes when debugging weird behavior.

Tokenization affects model behavior

Surprisingly often:

Glitch tokens: rare tokens (e.g. ” SolidGoldMagikarp” in GPT-2/3) trigger nonsense outputs.
Cross-language transfer: tokenizers trained mostly on English handle Italian better than Korean.
Reasoning quirks: counting “r”s in “strawberry” famously fails because “strawberry” tokenizes oddly.
Prompt sensitivity: tiny rephrases that change tokenization can change outputs.

Choosing or training a tokenizer

For most apps: use the model’s tokenizer. Don’t try to be clever.

If pretraining a model from scratch:

Pick a target vocabulary size (~32k–256k for modern LLMs).
Train on a corpus representative of your eventual use.
Include byte-level fallback for robustness.
Decide on whitespace handling early (SentencePiece-style explicit boundaries vs leading-space tokens).

For domain-specific models, sometimes a domain-specific tokenizer beats a general one — e.g. a code-only tokenizer encodes code more efficiently. Modern models often use a mixed-corpus tokenizer (text + code + multilingual + math) to compromise.

Practical advice

Always measure token counts when working with paid APIs — bad tokenization can 5× your bill.
Use the model’s own tokenizer for accurate counts. Char/word approximations are fine for budgets, not for length-critical prompts.
Inspect tokenization when debugging odd behavior.
Prefer recent tokenizers for non-English text — older ones (GPT-3 era) are notably worse at multilingual.