Tokenizer Surgery · ai-explained

What you can see in this demo

Whitespace is part of the token. Try GPT GPT4 GPT-4 — the leading space changes everything. Tokens with leading space render with a · marker so you can see the boundary.
Vocabulary size changes the bill. Switch from GPT-2's 50k vocab to GPT-4o's 200k. The same text often costs 20-40% fewer tokens at GPT-4o because longer common phrases become single tokens.
Foreign scripts pay a tax. Try the Vietnamese example. The tokenizer's training distribution decides what gets a single token vs. broken into byte-level fragments. English averages 4-5 chars/token; Vietnamese can drop below 2.
Hover any token to see its raw bytes, codepoints, and id. Some "tokens" are invisible whitespace; some are multi-byte UTF-8 fragments that don't even render as a complete character on their own.

Try this — predict before you click

Type GPT-4 GPT4 GPT4o. Predict: the three terms tokenize into wildly different lengths even though they look similar. The hyphen vs no-hyphen and the trailing o decide which BPE merges fire.
Switch encoder from r50k_base (GPT-2) to o200k_base (GPT-4o) on a paragraph of normal English. Predict: the o200k count drops 25–35%. The bigger vocab swallows more common phrases as single tokens.
Try the Vietnamese example on r50k_base. Predict: many characters become 2–3 byte-level tokens because the GPT-2 vocab barely saw Vietnamese. Switch to o200k_base — the count drops dramatically as the larger model picked up more diacritic-heavy languages.
Type aaaaaaaaa (9 a's). Predict: it tokenizes as one or two tokens, not nine. BPE merges repetition aggressively. Now try asdfgh. Predict: most or all of those characters become separate tokens — random sequences hit no merges.

Why this matters

Token count is the unit of cost, the unit of latency, and the unit of context. A 50-page PDF translated into Vietnamese might not fit in a context window where the English version does. Repetitive text compresses better than you expect; URLs and code compress worse. If you've ever wondered why your prompt is "more expensive than it should be" — this is usually why.

How it works

All three encoders run client-side via gpt-tokenizer — pure JavaScript, no WASM, no model load. Encoding happens synchronously on every keystroke; for typical input it's well under a frame.

The encoders are: r50k_base (GPT-2 / GPT-3), cl100k_base (GPT-3.5 / GPT-4), and o200k_base (GPT-4o). Claude and Llama use different tokenizers but the pattern is identical: byte-level BPE, learned merges, longest-match greedy encoding.

Anchored to 05-tokens-embeddings/tokenization from the learning path.

Where your prompt actually goes

What you can see in this demo

Try this — predict before you click

Why this matters

How it works