Embedding Models for Retrieval

The choice of embedding model often matters more than the choice of vector database, the chunking strategy, or the LLM. A great embedding model on a naive RAG beats a clever pipeline with a mediocre embedder.

What makes a good retrieval embedder

Different embedding models optimize for different things:

Semantic similarity: paraphrases close together. (Sentence-BERT family.)
Asymmetric retrieval: query → passage matching. (Most modern retrieval models.)
Question answering: queries phrased as questions, passages as text.
Classification: outputs that cluster by label.
Code search: code symbol similarity.

For RAG, you want asymmetric retrieval: the model is trained on (query, relevant passage) pairs. Often it has separate encoding modes (“query” vs “passage”).

The two-tower paradigm

Most retrieval embedders have:

A query encoder.
A passage encoder.

Sometimes shared weights, sometimes separate. Trained contrastively: pull positive pairs close, push negatives apart.

Modern variants often add instruction-tuning: the model accepts a task description as part of its input.

The MTEB leaderboard

The Massive Text Embedding Benchmark — covers 50+ tasks across retrieval, classification, clustering, semantic similarity. Updated weekly.

Don’t blindly pick the top model. MTEB has known overfitting; some models score well by being trained on benchmark-adjacent data. Always test on your data.

Closed-source options (early 2026)

OpenAI text-embedding-3-large — 3072 dims (or smaller via Matryoshka), strong general-purpose, multilingual.
OpenAI text-embedding-3-small — cheaper, 1536 dims, surprisingly competitive.
Voyage AI voyage-3-large — high-quality, configurable dims, domain variants (code, finance, law).
Cohere embed-v4 — multilingual, multimodal (text + image), good for production.
Anthropic doesn’t publish embeddings; you need a separate provider.
Google Gemini embeddings — competitive, multilingual.

Open-weights options

bge-large-en-v1.5 (BAAI) — strong English baseline.
bge-m3 — multilingual, multi-functional (dense, sparse, multi-vector).
mxbai-embed-large-v1 — strong open model.
gte-large-en-v1.5 — Alibaba, strong general performance.
nomic-embed-text-v2-moe — MoE-based open model.
jina-embeddings-v3 — long-context, late-chunking-friendly.
stella_en_1.5B_v5 — competitive 1.5B param model.
NV-Embed-v2 — NVIDIA’s decoder-based embedder, competitive with closed.
Qwen3-Embedding family — newer Qwen-based embedders.

Specialized embedders

Domain	Model
Code	`voyage-code-3`, `nomic-embed-code`, `CodeRankEmbed`
Multilingual	`multilingual-e5-large-instruct`, `bge-m3`, `jina-embeddings-v3`
Long context	`jina-embeddings-v3` (8k+), `nomic-embed-text-v2` (long-doc)
Multimodal (text+image)	CLIP, SigLIP, `voyage-multimodal-3`, OpenCLIP
Legal	`LegalBERT`-derived models, custom fine-tuned
Biomedical	`BioBERT`-derived, `MedEmbed`
Finance	`FinBERT`-derived, `finance-embeddings-investopedia`

For domain-specific apps, a domain-tuned embedder usually beats the latest general SOTA.

Decoder-based embedders

A 2024 trend: take a pretrained decoder LLM, fine-tune the last layers (or use mean-pooling of hidden states) for embedding. Surprisingly often beats encoder-only embedders.

Examples: NV-Embed-v2, GritLM, E5-Mistral, Qwen3-Embedding.

Why it works: decoders are bigger (more capacity), trained on more diverse data, and instruction-following naturally extends to “embed this for retrieval.”

Matryoshka embeddings (MRL)

Trained so that any prefix of the vector is a valid (lower-dim) embedding.

A 1024-dim Matryoshka embedding lets you:

Use 64 dims for fast first-stage filtering.
Use 256 dims for re-scoring.
Use full 1024 for final ranking.

Modern OpenAI, Voyage, and several open models support this. Lets you tune storage vs quality at retrieval time, not training time.

Multi-vector retrieval (ColBERT-style)

Instead of one vector per chunk, produce many — one per token. Match queries token-by-token against passages token-by-token.

More accurate (especially for long passages).
More expensive (storage and compute).
Used in ColBERT, JaColBERT, and as late-interaction re-rankers.

For most apps, a single dense vector + a cross-encoder re-ranker beats raw multi-vector retrieval on a cost basis.

Sparse + dense (hybrid)

We cover this in hybrid-search-and-reranking.md. Briefly: combine BM25 (sparse keyword matching) with dense embeddings. Often beats either alone.

bge-m3 and Splade produce both dense and sparse output from the same model — convenient.

Choosing for your use case

Scenario	Recommended starting point
English RAG, general	`text-embedding-3-large` or `voyage-3-large`
Open-source preferred	`bge-large-en-v1.5` or `bge-m3`
Code search	`voyage-code-3`
Multilingual (Asia, Europe)	`bge-m3` or `voyage-multilingual-2`
Long context per chunk (>512 tokens)	`jina-embeddings-v3` or use chunked OpenAI
Latency-critical	`text-embedding-3-small` with MRL → 256 dims
Budget-constrained	`text-embedding-3-small`
Multimodal (text + image)	`voyage-multimodal-3` or CLIP/SigLIP

Evaluating an embedder for your data

Build a small gold set: 50–200 (query, relevant_doc) pairs from your domain.
For each query, retrieve top-k from your corpus.
Measure recall@k — fraction of queries where the relevant doc is in top-k.
Compare across embedders.

Tools: BEIR-style eval, mteb library, custom scripts.

Don’t trust generic leaderboards alone. Your data is unique.

Domain adaptation

If a generic embedder underperforms on your data:

Fine-tune with contrastive loss on (query, positive) pairs from your domain. See Stage 10 — Embedding fine-tuning.
Use synthetic pairs: have an LLM generate questions for each document, fine-tune.
Use hard negatives: mine confusable but irrelevant docs to push apart.

Even a few thousand domain pairs can give a 5–15-point recall jump.

Operational concerns

Vector dimension drives storage cost. 1024-dim vs 1536-dim matters at billion-scale.
Embedding model upgrades = re-embed everything. Plan for it; version your indexes.
Latency: embedding APIs add 50–200ms per query. Cache aggressively.
Cost: at scale, embedding costs add up. Smaller dims, batch embedding, MRL truncation help.
Quantization: int8 or float16 vector storage often saves 4×–8× with minimal quality loss.

Pitfalls

Mixing models: vectors from text-embedding-3-small and bge-large are not comparable. Don’t mix.
Forgetting to use the right “mode”: many models distinguish query vs passage encoding. Check the docs.
Ignoring instructions: instruction-tuned embedders work better when given the task description (“Represent this query for retrieval…”).
Stale embeddings: source docs updated, embeddings not refreshed. Build re-indexing into your pipeline.