Static Embeddings

A static embedding is a fixed vector for each word in a vocabulary, regardless of context. The classic Word2Vec-era representations.

The shared embedding matrix

For a vocabulary of size V and embedding dim d:

E ∈ ℝ^(V × d)

A token w becomes the row E[w]. That’s the embedding lookup at the bottom of every transformer.

In static embedding models, this matrix is the only learned representation; the same word gets the same vector in every context.

Word2Vec (2013)

Two architectures from Mikolov et al.:

CBOW (Continuous Bag of Words)

Predict a center word from its context.

context (e.g. "the ___ jumped") → predict "fox"

Faster to train; smoother embeddings.

Skip-gram

Predict context words from a center word.

"fox" → predict "the", "quick", "brown", "jumped"

Slower but better for rare words.

Both train with negative sampling: instead of a full softmax over the vocabulary, sample a few negatives and contrast.

L = log σ(e_w · e_c) + Σ_neg log σ(−e_w · e_neg)

What Word2Vec captures

The famous analogy:

king − man + woman ≈ queen
Paris − France + Italy ≈ Rome

This works because the geometry of the embedding space reflects compositional semantics — directions roughly correspond to attributes (gender, country, tense).

But Word2Vec also captures noise and bias from training data. The same trick that gives “king − man + woman ≈ queen” can give “doctor − man + woman ≈ nurse.” Embedding bias is real and worth checking.

GloVe (2014)

Global Vectors for Word Representation. Trains by factorizing a word-word co-occurrence matrix:

e_i · e_j ≈ log(P(j | i))

A different mathematical lineage (matrix factorization vs. predictive modeling), comparable quality. Sometimes preferred for being deterministic and interpretable; rarely used for new projects.

FastText (2016)

Word2Vec with character n-grams. Each word is represented as the sum of its character n-gram embeddings:

e_word = Σ_{n-gram in word} e_{n-gram}

Benefits:

  • Out-of-vocabulary handling: novel words get sensible embeddings via their n-grams.
  • Morphologically rich languages (Finnish, Turkish): fewer “unseen” words.
  • Spelling errors: “teh” gets an embedding close to “the.”

Limits of static embeddings

A single vector per word can’t capture:

  • Polysemy: “bank” (river vs financial) gets one vector — the average of its meanings.
  • Sense disambiguation: “He drove to the bank to fish” should activate the river meaning; static embeddings can’t.
  • Syntax-dependent meaning: “made of sand” vs “made the cut” use “made” very differently.

Hence the next leap: contextual embeddings (next article).

Where static embeddings still appear

  • Retrieval at scale: when you need extremely cheap representations for billions of items, static word/sentence embeddings are still considered.
  • Bag-of-words baselines: averaging word embeddings is a reasonable sentence representation if you don’t have anything better.
  • Sub-word embeddings: in modern transformers, the input embedding lookup (before any context) is technically still a static embedding. Context is added by the layers above.

In modern AI, “embedding model” usually means a contextual one (sentence-transformers, OpenAI text-embedding-3, Voyage, Cohere). Static embeddings rarely get trained from scratch for new applications.

Connection to LLMs

The bottom layer of any transformer is an embedding lookup E[token]. Conceptually, this is a static embedding. But because the layers above contextualize it, the effective representation of each token at any layer above the first is contextual.

Some researchers analyze a model by inspecting these per-layer representations — early layers stay close to static; mid/late layers are highly contextual. This is the lens behind probing and mechanistic interpretability.

Practical advice

For most modern apps:

  • Don’t train Word2Vec/GloVe/FastText from scratch.
  • Use a sentence-transformer or commercial embedding model for retrieval.
  • Use a multimodal model (CLIP) if you have images and text.

The exception is if you’re building something genuinely massive-scale and a sentence-transformer is too slow. Then static embeddings might come back as a layer in a hybrid system.

See also