Static Embeddings
A static embedding is a fixed vector for each word in a vocabulary, regardless of context. The classic Word2Vec-era representations.
The shared embedding matrix
For a vocabulary of size V and embedding dim d:
E ∈ ℝ^(V × d)
A token w becomes the row E[w]. That’s the embedding lookup at the bottom of every transformer.
In static embedding models, this matrix is the only learned representation; the same word gets the same vector in every context.
Word2Vec (2013)
Two architectures from Mikolov et al.:
CBOW (Continuous Bag of Words)
Predict a center word from its context.
context (e.g. "the ___ jumped") → predict "fox"
Faster to train; smoother embeddings.
Skip-gram
Predict context words from a center word.
"fox" → predict "the", "quick", "brown", "jumped"
Slower but better for rare words.
Both train with negative sampling: instead of a full softmax over the vocabulary, sample a few negatives and contrast.
L = log σ(e_w · e_c) + Σ_neg log σ(−e_w · e_neg)
What Word2Vec captures
The famous analogy:
king − man + woman ≈ queen
Paris − France + Italy ≈ Rome
This works because the geometry of the embedding space reflects compositional semantics — directions roughly correspond to attributes (gender, country, tense).
But Word2Vec also captures noise and bias from training data. The same trick that gives “king − man + woman ≈ queen” can give “doctor − man + woman ≈ nurse.” Embedding bias is real and worth checking.
GloVe (2014)
Global Vectors for Word Representation. Trains by factorizing a word-word co-occurrence matrix:
e_i · e_j ≈ log(P(j | i))
A different mathematical lineage (matrix factorization vs. predictive modeling), comparable quality. Sometimes preferred for being deterministic and interpretable; rarely used for new projects.
FastText (2016)
Word2Vec with character n-grams. Each word is represented as the sum of its character n-gram embeddings:
e_word = Σ_{n-gram in word} e_{n-gram}
Benefits:
- Out-of-vocabulary handling: novel words get sensible embeddings via their n-grams.
- Morphologically rich languages (Finnish, Turkish): fewer “unseen” words.
- Spelling errors: “teh” gets an embedding close to “the.”
Limits of static embeddings
A single vector per word can’t capture:
- Polysemy: “bank” (river vs financial) gets one vector — the average of its meanings.
- Sense disambiguation: “He drove to the bank to fish” should activate the river meaning; static embeddings can’t.
- Syntax-dependent meaning: “made of sand” vs “made the cut” use “made” very differently.
Hence the next leap: contextual embeddings (next article).
Where static embeddings still appear
- Retrieval at scale: when you need extremely cheap representations for billions of items, static word/sentence embeddings are still considered.
- Bag-of-words baselines: averaging word embeddings is a reasonable sentence representation if you don’t have anything better.
- Sub-word embeddings: in modern transformers, the input embedding lookup (before any context) is technically still a static embedding. Context is added by the layers above.
In modern AI, “embedding model” usually means a contextual one (sentence-transformers, OpenAI text-embedding-3, Voyage, Cohere). Static embeddings rarely get trained from scratch for new applications.
Connection to LLMs
The bottom layer of any transformer is an embedding lookup E[token]. Conceptually, this is a static embedding. But because the layers above contextualize it, the effective representation of each token at any layer above the first is contextual.
Some researchers analyze a model by inspecting these per-layer representations — early layers stay close to static; mid/late layers are highly contextual. This is the lens behind probing and mechanistic interpretability.
Practical advice
For most modern apps:
- Don’t train Word2Vec/GloVe/FastText from scratch.
- Use a sentence-transformer or commercial embedding model for retrieval.
- Use a multimodal model (CLIP) if you have images and text.
The exception is if you’re building something genuinely massive-scale and a sentence-transformer is too slow. Then static embeddings might come back as a layer in a hybrid system.