Contextual Embeddings
A contextual embedding is a vector for a word in a particular context. The same word in different sentences gets different vectors. This is what modern embedding models produce.
The shift
Static embeddings: one vector per word.
Contextual embeddings: one vector per word occurrence, computed by a deep model that looks at the surrounding text.
"He sat on the bank of the river" → embedding of "bank" leans toward {river, shore, water}
"He deposited cash at the bank" → embedding of "bank" leans toward {money, finance, account}
This is the difference between a 2010s NLP model and a 2018+ NLP model.
ELMo (2018)
Embeddings from Language Models. The first widely-used contextual embeddings.
- Train a bidirectional LSTM language model.
- Represent each token by a learnable combination of the LSTM’s per-layer hidden states.
- Use these as features in downstream models.
ELMo demonstrated that contextual representations beat static ones across nearly every NLP benchmark — but it was already on its way out as BERT arrived.
BERT (2018)
Bidirectional Encoder Representations from Transformers. The big leap.
Architecture: a transformer encoder (bidirectional self-attention).
Training objective: masked language modeling.
- Take a sentence.
- Randomly mask 15% of tokens.
- Train the model to predict the masked tokens given the rest.
"The quick brown [MASK] jumps over the lazy dog"
↑ predict "fox"
A second auxiliary objective in the original paper was next sentence prediction (later shown to be useless and dropped in RoBERTa).
What BERT gives you
For each token, a 768-dim (base) or 1024-dim (large) vector that depends on the entire sentence. These can be:
- Used directly as features.
- Fine-tuned on a downstream task (classification, QA, NER) with a small head added.
For classification, the convention is to use the [CLS] token’s embedding as a sentence representation.
Variants
- RoBERTa: BERT with better training (no NSP, more data, longer training, different masking schedule).
- DistilBERT: knowledge-distilled smaller version.
- ALBERT: parameter-shared variant; smaller but slower per-step.
- DeBERTa: improved with disentangled attention; SOTA on GLUE for years.
- ELECTRA: trained to discriminate real vs. replaced tokens — more compute-efficient.
For 2018–2022, BERT-family models were the default for any “encode this text into a vector” task.
Sentence embeddings: BERT isn’t enough alone
BERT’s [CLS] token isn’t actually a great sentence representation out of the box. The embeddings of similar sentences aren’t reliably close.
Sentence-BERT (Reimers & Gurevych, 2019) fixed this:
- Take a pretrained BERT.
- Fine-tune with a contrastive or triplet loss on sentence pairs.
- Output the
[CLS](or pooled) embedding.
This produced the sentence-transformers ecosystem — high-quality sentence embeddings via cosine similarity.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(["A cat sat on the mat", "There is a feline on the rug"])
Modern embedding models (2023–2026)
The landscape evolved:
General-purpose
- OpenAI
text-embedding-3-large/small— strong, multilingual. - Voyage AI — high-quality, configurable dimensions, domain variants.
- Cohere
embed-v4— multilingual, multimodal. nomic-embed-text-v2— open-weights competitive option.bge-large-en-v1.5,mxbai-embed-large— strong open models.
Multilingual
multilingual-e5-large-instruct— 100+ languages, instruction-tuned.
Code
voyage-code-3— code-specialized.CodeT5+,UniXcoder— research-grade.
Long-context
jina-embeddings-v3— handles 8k+ token chunks.
Multimodal
- CLIP, SigLIP, OpenCLIP — text + image.
- EVA-CLIP — scaled CLIP variants.
- CoCa, BLIP-2 — text + image with generation.
Decoder-based embedding models (2024+)
A surprising trend: LLM hidden states make great embeddings. Models like NV-Embed, GritLM, E5-Mistral, and Qwen3-Embedding take a pretrained decoder LLM and fine-tune the last layer for embedding. Often beats encoder-only embedding models.
Matryoshka embeddings (MRL)
A 2024 trick: train embeddings such that any prefix of the vector is itself a meaningful (lower-dimensional) embedding.
A 1024-dim Matryoshka embedding gives you a usable 64-dim embedding by truncating to the first 64 dims. This means:
- Storage choice at retrieval time, not training time.
- Use small dims for fast first-stage retrieval, full dims for re-scoring.
- Embed once, reuse for many use cases.
Most modern embedding APIs support this (OpenAI, Voyage, etc.).
How contextual embeddings are produced today
For modern embedding models, the recipe is roughly:
- Pretrain a transformer (encoder or decoder) on a large text corpus.
- Contrastive fine-tune on a corpus of (query, positive, negative) triples — millions of them. Pull positives close in embedding space; push negatives apart.
- Optionally distill to a smaller model.
- Optionally task-condition with prefixes (“query:”, “passage:”, “search:”) so the model knows which mode it’s in.
The retrieval-quality leaderboards (MTEB) move every couple of months.
Choosing an embedding model
| Task | Pick |
|---|---|
| English RAG, general | text-embedding-3-large or voyage-3 or bge-large-en-v1.5 |
| Multilingual RAG | voyage-multilingual-2 or multilingual-e5-large-instruct |
| Code search | voyage-code-3 |
| Long context (10k+) | jina-embeddings-v3 |
| Multimodal (text+image) | CLIP / SigLIP / Voyage multimodal |
| Latency-critical | smaller model (e.g. all-MiniLM-L6-v2, MRL truncation) |
When in doubt: try 2–3 on your actual data with retrieval@10 as your metric. Don’t trust generic leaderboards alone.