Stage 05 — Tokens & Embeddings

Before any language model does anything, raw text becomes integers (tokens) and integers become vectors (embeddings). This stage covers both — the data pipeline at the bottom of every LLM.

Prerequisites

  • Stage 03 (neural networks)
  • Stage 04 (neural language models)

Learning ladder

  1. Tokenization — text → integers (BPE, WordPiece, SentencePiece, byte-level)
  2. Static embeddings — Word2Vec, GloVe, FastText
  3. Contextual embeddings — ELMo, BERT, modern embedding models
  4. Semantic geometry — what makes embeddings useful

MVU

You can:

  • Tokenize a sentence with BPE and explain why “tokenization” is one token while “tokenization-aware” is several.
  • Distinguish static from contextual embeddings.
  • Use cosine similarity to measure embedding similarity and explain why we don’t use Euclidean distance.
  • Pick an embedding model for a given task (general-purpose, code, multilingual, multimodal).

Exercise

Take 1000 product reviews. Embed them with sentence-transformers/all-MiniLM-L6-v2. For a query review, find the 5 nearest neighbors. Then try a domain-specific model (e.g. a code-tuned one). Notice how task fit affects retrieval quality.

Why this stage matters

Every transformer’s first operation is embed(token_id). The quality of that lookup — vocabulary granularity, embedding initialization, contextual reshaping — sets the ceiling for everything downstream. RAG systems live or die on embedding quality. Understand this layer or be confused by everything above it.

See also