Stage 05 — Tokens & Embeddings

Before any language model does anything, raw text becomes integers (tokens) and integers become vectors (embeddings). This stage covers both — the data pipeline at the bottom of every LLM.

Prerequisites

Stage 03 (neural networks)
Stage 04 (neural language models)

Learning ladder

Tokenization — text → integers (BPE, WordPiece, SentencePiece, byte-level)
Static embeddings — Word2Vec, GloVe, FastText
Contextual embeddings — ELMo, BERT, modern embedding models
Semantic geometry — what makes embeddings useful

MVU

You can:

Tokenize a sentence with BPE and explain why “tokenization” is one token while “tokenization-aware” is several.
Distinguish static from contextual embeddings.
Use cosine similarity to measure embedding similarity and explain why we don’t use Euclidean distance.
Pick an embedding model for a given task (general-purpose, code, multilingual, multimodal).

Exercise

Take 1000 product reviews. Embed them with sentence-transformers/all-MiniLM-L6-v2. For a query review, find the 5 nearest neighbors. Then try a domain-specific model (e.g. a code-tuned one). Notice how task fit affects retrieval quality.

Why this stage matters

Every transformer’s first operation is embed(token_id). The quality of that lookup — vocabulary granularity, embedding initialization, contextual reshaping — sets the ceiling for everything downstream. RAG systems live or die on embedding quality. Understand this layer or be confused by everything above it.