Contextual Embeddings

A contextual embedding is a vector for a word in a particular context. The same word in different sentences gets different vectors. This is what modern embedding models produce.

The shift

Static embeddings: one vector per word.

Contextual embeddings: one vector per word occurrence, computed by a deep model that looks at the surrounding text.

"He sat on the bank of the river"  →  embedding of "bank" leans toward {river, shore, water}
"He deposited cash at the bank"     →  embedding of "bank" leans toward {money, finance, account}

This is the difference between a 2010s NLP model and a 2018+ NLP model.

ELMo (2018)

Embeddings from Language Models. The first widely-used contextual embeddings.

Train a bidirectional LSTM language model.
Represent each token by a learnable combination of the LSTM’s per-layer hidden states.
Use these as features in downstream models.

ELMo demonstrated that contextual representations beat static ones across nearly every NLP benchmark — but it was already on its way out as BERT arrived.

BERT (2018)

Bidirectional Encoder Representations from Transformers. The big leap.

Architecture: a transformer encoder (bidirectional self-attention).

Training objective: masked language modeling.

Take a sentence.
Randomly mask 15% of tokens.
Train the model to predict the masked tokens given the rest.

"The quick brown [MASK] jumps over the lazy dog"
              ↑ predict "fox"

A second auxiliary objective in the original paper was next sentence prediction (later shown to be useless and dropped in RoBERTa).

What BERT gives you

For each token, a 768-dim (base) or 1024-dim (large) vector that depends on the entire sentence. These can be:

Used directly as features.
Fine-tuned on a downstream task (classification, QA, NER) with a small head added.

For classification, the convention is to use the [CLS] token’s embedding as a sentence representation.

Variants

RoBERTa: BERT with better training (no NSP, more data, longer training, different masking schedule).
DistilBERT: knowledge-distilled smaller version.
ALBERT: parameter-shared variant; smaller but slower per-step.
DeBERTa: improved with disentangled attention; SOTA on GLUE for years.
ELECTRA: trained to discriminate real vs. replaced tokens — more compute-efficient.

For 2018–2022, BERT-family models were the default for any “encode this text into a vector” task.

Sentence embeddings: BERT isn’t enough alone

BERT’s [CLS] token isn’t actually a great sentence representation out of the box. The embeddings of similar sentences aren’t reliably close.

Sentence-BERT (Reimers & Gurevych, 2019) fixed this:

Take a pretrained BERT.
Fine-tune with a contrastive or triplet loss on sentence pairs.
Output the [CLS] (or pooled) embedding.

This produced the sentence-transformers ecosystem — high-quality sentence embeddings via cosine similarity.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(["A cat sat on the mat", "There is a feline on the rug"])

Modern embedding models (2023–2026)

The landscape evolved:

General-purpose

OpenAI text-embedding-3-large/small — strong, multilingual.
Voyage AI — high-quality, configurable dimensions, domain variants.
Cohere embed-v4 — multilingual, multimodal.
nomic-embed-text-v2 — open-weights competitive option.
bge-large-en-v1.5, mxbai-embed-large — strong open models.

Multilingual

multilingual-e5-large-instruct — 100+ languages, instruction-tuned.

Code

voyage-code-3 — code-specialized.
CodeT5+, UniXcoder — research-grade.

Long-context

jina-embeddings-v3 — handles 8k+ token chunks.

Multimodal

CLIP, SigLIP, OpenCLIP — text + image.
EVA-CLIP — scaled CLIP variants.
CoCa, BLIP-2 — text + image with generation.

Decoder-based embedding models (2024+)

A surprising trend: LLM hidden states make great embeddings. Models like NV-Embed, GritLM, E5-Mistral, and Qwen3-Embedding take a pretrained decoder LLM and fine-tune the last layer for embedding. Often beats encoder-only embedding models.

Matryoshka embeddings (MRL)

A 2024 trick: train embeddings such that any prefix of the vector is itself a meaningful (lower-dimensional) embedding.

A 1024-dim Matryoshka embedding gives you a usable 64-dim embedding by truncating to the first 64 dims. This means:

Storage choice at retrieval time, not training time.
Use small dims for fast first-stage retrieval, full dims for re-scoring.
Embed once, reuse for many use cases.

Most modern embedding APIs support this (OpenAI, Voyage, etc.).

How contextual embeddings are produced today

For modern embedding models, the recipe is roughly:

Pretrain a transformer (encoder or decoder) on a large text corpus.
Contrastive fine-tune on a corpus of (query, positive, negative) triples — millions of them. Pull positives close in embedding space; push negatives apart.
Optionally distill to a smaller model.
Optionally task-condition with prefixes (“query:”, “passage:”, “search:”) so the model knows which mode it’s in.

The retrieval-quality leaderboards (MTEB) move every couple of months.

Choosing an embedding model

Task	Pick
English RAG, general	`text-embedding-3-large` or `voyage-3` or `bge-large-en-v1.5`
Multilingual RAG	`voyage-multilingual-2` or `multilingual-e5-large-instruct`
Code search	`voyage-code-3`
Long context (10k+)	`jina-embeddings-v3`
Multimodal (text+image)	CLIP / SigLIP / Voyage multimodal
Latency-critical	smaller model (e.g. `all-MiniLM-L6-v2`, MRL truncation)

When in doubt: try 2–3 on your actual data with retrieval@10 as your metric. Don’t trust generic leaderboards alone.