Embedding Models for Retrieval
The choice of embedding model often matters more than the choice of vector database, the chunking strategy, or the LLM. A great embedding model on a naive RAG beats a clever pipeline with a mediocre embedder.
What makes a good retrieval embedder
Different embedding models optimize for different things:
- Semantic similarity: paraphrases close together. (Sentence-BERT family.)
- Asymmetric retrieval: query → passage matching. (Most modern retrieval models.)
- Question answering: queries phrased as questions, passages as text.
- Classification: outputs that cluster by label.
- Code search: code symbol similarity.
For RAG, you want asymmetric retrieval: the model is trained on (query, relevant passage) pairs. Often it has separate encoding modes (“query” vs “passage”).
The two-tower paradigm
Most retrieval embedders have:
- A query encoder.
- A passage encoder.
Sometimes shared weights, sometimes separate. Trained contrastively: pull positive pairs close, push negatives apart.
Modern variants often add instruction-tuning: the model accepts a task description as part of its input.
The MTEB leaderboard
The Massive Text Embedding Benchmark — covers 50+ tasks across retrieval, classification, clustering, semantic similarity. Updated weekly.
Don’t blindly pick the top model. MTEB has known overfitting; some models score well by being trained on benchmark-adjacent data. Always test on your data.
Closed-source options (early 2026)
- OpenAI
text-embedding-3-large— 3072 dims (or smaller via Matryoshka), strong general-purpose, multilingual. - OpenAI
text-embedding-3-small— cheaper, 1536 dims, surprisingly competitive. - Voyage AI
voyage-3-large— high-quality, configurable dims, domain variants (code, finance, law). - Cohere
embed-v4— multilingual, multimodal (text + image), good for production. - Anthropic doesn’t publish embeddings; you need a separate provider.
- Google Gemini embeddings — competitive, multilingual.
Open-weights options
bge-large-en-v1.5(BAAI) — strong English baseline.bge-m3— multilingual, multi-functional (dense, sparse, multi-vector).mxbai-embed-large-v1— strong open model.gte-large-en-v1.5— Alibaba, strong general performance.nomic-embed-text-v2-moe— MoE-based open model.jina-embeddings-v3— long-context, late-chunking-friendly.stella_en_1.5B_v5— competitive 1.5B param model.- NV-Embed-v2 — NVIDIA’s decoder-based embedder, competitive with closed.
- Qwen3-Embedding family — newer Qwen-based embedders.
Specialized embedders
| Domain | Model |
|---|---|
| Code | voyage-code-3, nomic-embed-code, CodeRankEmbed |
| Multilingual | multilingual-e5-large-instruct, bge-m3, jina-embeddings-v3 |
| Long context | jina-embeddings-v3 (8k+), nomic-embed-text-v2 (long-doc) |
| Multimodal (text+image) | CLIP, SigLIP, voyage-multimodal-3, OpenCLIP |
| Legal | LegalBERT-derived models, custom fine-tuned |
| Biomedical | BioBERT-derived, MedEmbed |
| Finance | FinBERT-derived, finance-embeddings-investopedia |
For domain-specific apps, a domain-tuned embedder usually beats the latest general SOTA.
Decoder-based embedders
A 2024 trend: take a pretrained decoder LLM, fine-tune the last layers (or use mean-pooling of hidden states) for embedding. Surprisingly often beats encoder-only embedders.
Examples: NV-Embed-v2, GritLM, E5-Mistral, Qwen3-Embedding.
Why it works: decoders are bigger (more capacity), trained on more diverse data, and instruction-following naturally extends to “embed this for retrieval.”
Matryoshka embeddings (MRL)
Trained so that any prefix of the vector is a valid (lower-dim) embedding.
A 1024-dim Matryoshka embedding lets you:
- Use 64 dims for fast first-stage filtering.
- Use 256 dims for re-scoring.
- Use full 1024 for final ranking.
Modern OpenAI, Voyage, and several open models support this. Lets you tune storage vs quality at retrieval time, not training time.
Multi-vector retrieval (ColBERT-style)
Instead of one vector per chunk, produce many — one per token. Match queries token-by-token against passages token-by-token.
- More accurate (especially for long passages).
- More expensive (storage and compute).
- Used in ColBERT, JaColBERT, and as late-interaction re-rankers.
For most apps, a single dense vector + a cross-encoder re-ranker beats raw multi-vector retrieval on a cost basis.
Sparse + dense (hybrid)
We cover this in hybrid-search-and-reranking.md. Briefly: combine BM25 (sparse keyword matching) with dense embeddings. Often beats either alone.
bge-m3 and Splade produce both dense and sparse output from the same model — convenient.
Choosing for your use case
| Scenario | Recommended starting point |
|---|---|
| English RAG, general | text-embedding-3-large or voyage-3-large |
| Open-source preferred | bge-large-en-v1.5 or bge-m3 |
| Code search | voyage-code-3 |
| Multilingual (Asia, Europe) | bge-m3 or voyage-multilingual-2 |
| Long context per chunk (>512 tokens) | jina-embeddings-v3 or use chunked OpenAI |
| Latency-critical | text-embedding-3-small with MRL → 256 dims |
| Budget-constrained | text-embedding-3-small |
| Multimodal (text + image) | voyage-multimodal-3 or CLIP/SigLIP |
Evaluating an embedder for your data
- Build a small gold set: 50–200 (query, relevant_doc) pairs from your domain.
- For each query, retrieve top-k from your corpus.
- Measure recall@k — fraction of queries where the relevant doc is in top-k.
- Compare across embedders.
Tools: BEIR-style eval, mteb library, custom scripts.
Don’t trust generic leaderboards alone. Your data is unique.
Domain adaptation
If a generic embedder underperforms on your data:
- Fine-tune with contrastive loss on (query, positive) pairs from your domain. See Stage 10 — Embedding fine-tuning.
- Use synthetic pairs: have an LLM generate questions for each document, fine-tune.
- Use hard negatives: mine confusable but irrelevant docs to push apart.
Even a few thousand domain pairs can give a 5–15-point recall jump.
Operational concerns
- Vector dimension drives storage cost. 1024-dim vs 1536-dim matters at billion-scale.
- Embedding model upgrades = re-embed everything. Plan for it; version your indexes.
- Latency: embedding APIs add 50–200ms per query. Cache aggressively.
- Cost: at scale, embedding costs add up. Smaller dims, batch embedding, MRL truncation help.
- Quantization: int8 or float16 vector storage often saves 4×–8× with minimal quality loss.
Pitfalls
- Mixing models: vectors from
text-embedding-3-smallandbge-largeare not comparable. Don’t mix. - Forgetting to use the right “mode”: many models distinguish query vs passage encoding. Check the docs.
- Ignoring instructions: instruction-tuned embedders work better when given the task description (“Represent this query for retrieval…”).
- Stale embeddings: source docs updated, embeddings not refreshed. Build re-indexing into your pipeline.