05

stage · curriculum

Tokens & Embeddings

Tokenization and embeddings are the data pipeline at the bottom of every LLM. RAG systems live or die on embedding quality. Understand vocabulary granularity, semantic geometry, and contextual reshaping — or be confused by everything above.

4 articles
16 min to read
3 demos
3 books
if you only do one thing

Words have geometry. The arithmetic king − man + woman ≈ queen is not a parlor trick — it's why retrieval works at all.

Articles in this stage

  1. 01 Contextual Embeddings
  2. 02 Semantic Geometry
  3. 03 Static Embeddings
  4. 04 Tokenization

Stage 05 — Tokens & Embeddings

Before any language model does anything, raw text becomes integers (tokens) and integers become vectors (embeddings). This stage covers both — the data pipeline at the bottom of every LLM.

Prerequisites

  • Stage 03 (neural networks)
  • Stage 04 (neural language models)

Learning ladder

  1. Tokenization — text → integers (BPE, WordPiece, SentencePiece, byte-level)
  2. Static embeddings — Word2Vec, GloVe, FastText
  3. Contextual embeddings — ELMo, BERT, modern embedding models
  4. Semantic geometry — what makes embeddings useful

MVU

You can:

  • Tokenize a sentence with BPE and explain why “tokenization” is one token while “tokenization-aware” is several.
  • Distinguish static from contextual embeddings.
  • Use cosine similarity to measure embedding similarity and explain why we don’t use Euclidean distance.
  • Pick an embedding model for a given task (general-purpose, code, multilingual, multimodal).

Exercise

Take 1000 product reviews. Embed them with sentence-transformers/all-MiniLM-L6-v2. For a query review, find the 5 nearest neighbors. Then try a domain-specific model (e.g. a code-tuned one). Notice how task fit affects retrieval quality.

Why this stage matters

Every transformer’s first operation is embed(token_id). The quality of that lookup — vocabulary granularity, embedding initialization, contextual reshaping — sets the ceiling for everything downstream. RAG systems live or die on embedding quality. Understand this layer or be confused by everything above it.

See also

Further reading

Books move slower than papers in this field — treat these as foundations, not replacements for the latest research. Real authors, real publishers, real editions. Free badges mark books with author-authorized full text online.

  1. Natural Language Processing with Transformers cover

    Natural Language Processing with Transformers

    Lewis Tunstall, Leandro von Werra, Thomas Wolf

    O'Reilly, Revised ed., 2023

    The HuggingFace book. Architecture-first treatment with working code.

  2. free

    Speech and Language Processing

    Daniel Jurafsky, James H. Martin

    Stanford (3rd ed. draft), 3rd ed. draft, 2024

    The canonical NLP textbook. Draft chapters free online.