Stage 05 — Tokens & Embeddings
Before any language model does anything, raw text becomes integers (tokens) and integers become vectors (embeddings). This stage covers both — the data pipeline at the bottom of every LLM.
Prerequisites
- Stage 03 (neural networks)
- Stage 04 (neural language models)
Learning ladder
- Tokenization — text → integers (BPE, WordPiece, SentencePiece, byte-level)
- Static embeddings — Word2Vec, GloVe, FastText
- Contextual embeddings — ELMo, BERT, modern embedding models
- Semantic geometry — what makes embeddings useful
MVU
You can:
- Tokenize a sentence with BPE and explain why “tokenization” is one token while “tokenization-aware” is several.
- Distinguish static from contextual embeddings.
- Use cosine similarity to measure embedding similarity and explain why we don’t use Euclidean distance.
- Pick an embedding model for a given task (general-purpose, code, multilingual, multimodal).
Exercise
Take 1000 product reviews. Embed them with sentence-transformers/all-MiniLM-L6-v2. For a query review, find the 5 nearest neighbors. Then try a domain-specific model (e.g. a code-tuned one). Notice how task fit affects retrieval quality.
Why this stage matters
Every transformer’s first operation is embed(token_id). The quality of that lookup — vocabulary granularity, embedding initialization, contextual reshaping — sets the ceiling for everything downstream. RAG systems live or die on embedding quality. Understand this layer or be confused by everything above it.