Contextual Embeddings

A contextual embedding is a vector for a word in a particular context. The same word in different sentences gets different vectors. This is what modern embedding models produce.

The shift

Static embeddings: one vector per word.

Contextual embeddings: one vector per word occurrence, computed by a deep model that looks at the surrounding text.

"He sat on the bank of the river"  →  embedding of "bank" leans toward {river, shore, water}
"He deposited cash at the bank"     →  embedding of "bank" leans toward {money, finance, account}

This is the difference between a 2010s NLP model and a 2018+ NLP model.

ELMo (2018)

Embeddings from Language Models. The first widely-used contextual embeddings.

  • Train a bidirectional LSTM language model.
  • Represent each token by a learnable combination of the LSTM’s per-layer hidden states.
  • Use these as features in downstream models.

ELMo demonstrated that contextual representations beat static ones across nearly every NLP benchmark — but it was already on its way out as BERT arrived.

BERT (2018)

Bidirectional Encoder Representations from Transformers. The big leap.

Architecture: a transformer encoder (bidirectional self-attention).

Training objective: masked language modeling.

  1. Take a sentence.
  2. Randomly mask 15% of tokens.
  3. Train the model to predict the masked tokens given the rest.
"The quick brown [MASK] jumps over the lazy dog"
              ↑ predict "fox"

A second auxiliary objective in the original paper was next sentence prediction (later shown to be useless and dropped in RoBERTa).

What BERT gives you

For each token, a 768-dim (base) or 1024-dim (large) vector that depends on the entire sentence. These can be:

  • Used directly as features.
  • Fine-tuned on a downstream task (classification, QA, NER) with a small head added.

For classification, the convention is to use the [CLS] token’s embedding as a sentence representation.

Variants

  • RoBERTa: BERT with better training (no NSP, more data, longer training, different masking schedule).
  • DistilBERT: knowledge-distilled smaller version.
  • ALBERT: parameter-shared variant; smaller but slower per-step.
  • DeBERTa: improved with disentangled attention; SOTA on GLUE for years.
  • ELECTRA: trained to discriminate real vs. replaced tokens — more compute-efficient.

For 2018–2022, BERT-family models were the default for any “encode this text into a vector” task.

Sentence embeddings: BERT isn’t enough alone

BERT’s [CLS] token isn’t actually a great sentence representation out of the box. The embeddings of similar sentences aren’t reliably close.

Sentence-BERT (Reimers & Gurevych, 2019) fixed this:

  • Take a pretrained BERT.
  • Fine-tune with a contrastive or triplet loss on sentence pairs.
  • Output the [CLS] (or pooled) embedding.

This produced the sentence-transformers ecosystem — high-quality sentence embeddings via cosine similarity.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(["A cat sat on the mat", "There is a feline on the rug"])

Modern embedding models (2023–2026)

The landscape evolved:

General-purpose

  • OpenAI text-embedding-3-large/small — strong, multilingual.
  • Voyage AI — high-quality, configurable dimensions, domain variants.
  • Cohere embed-v4 — multilingual, multimodal.
  • nomic-embed-text-v2 — open-weights competitive option.
  • bge-large-en-v1.5, mxbai-embed-large — strong open models.

Multilingual

  • multilingual-e5-large-instruct — 100+ languages, instruction-tuned.

Code

  • voyage-code-3 — code-specialized.
  • CodeT5+, UniXcoder — research-grade.

Long-context

  • jina-embeddings-v3 — handles 8k+ token chunks.

Multimodal

  • CLIP, SigLIP, OpenCLIP — text + image.
  • EVA-CLIP — scaled CLIP variants.
  • CoCa, BLIP-2 — text + image with generation.

Decoder-based embedding models (2024+)

A surprising trend: LLM hidden states make great embeddings. Models like NV-Embed, GritLM, E5-Mistral, and Qwen3-Embedding take a pretrained decoder LLM and fine-tune the last layer for embedding. Often beats encoder-only embedding models.

Matryoshka embeddings (MRL)

A 2024 trick: train embeddings such that any prefix of the vector is itself a meaningful (lower-dimensional) embedding.

A 1024-dim Matryoshka embedding gives you a usable 64-dim embedding by truncating to the first 64 dims. This means:

  • Storage choice at retrieval time, not training time.
  • Use small dims for fast first-stage retrieval, full dims for re-scoring.
  • Embed once, reuse for many use cases.

Most modern embedding APIs support this (OpenAI, Voyage, etc.).

How contextual embeddings are produced today

For modern embedding models, the recipe is roughly:

  1. Pretrain a transformer (encoder or decoder) on a large text corpus.
  2. Contrastive fine-tune on a corpus of (query, positive, negative) triples — millions of them. Pull positives close in embedding space; push negatives apart.
  3. Optionally distill to a smaller model.
  4. Optionally task-condition with prefixes (“query:”, “passage:”, “search:”) so the model knows which mode it’s in.

The retrieval-quality leaderboards (MTEB) move every couple of months.

Choosing an embedding model

TaskPick
English RAG, generaltext-embedding-3-large or voyage-3 or bge-large-en-v1.5
Multilingual RAGvoyage-multilingual-2 or multilingual-e5-large-instruct
Code searchvoyage-code-3
Long context (10k+)jina-embeddings-v3
Multimodal (text+image)CLIP / SigLIP / Voyage multimodal
Latency-criticalsmaller model (e.g. all-MiniLM-L6-v2, MRL truncation)

When in doubt: try 2–3 on your actual data with retrieval@10 as your metric. Don’t trust generic leaderboards alone.

See also