Embedding Models for Retrieval

The choice of embedding model often matters more than the choice of vector database, the chunking strategy, or the LLM. A great embedding model on a naive RAG beats a clever pipeline with a mediocre embedder.

What makes a good retrieval embedder

Different embedding models optimize for different things:

  • Semantic similarity: paraphrases close together. (Sentence-BERT family.)
  • Asymmetric retrieval: query → passage matching. (Most modern retrieval models.)
  • Question answering: queries phrased as questions, passages as text.
  • Classification: outputs that cluster by label.
  • Code search: code symbol similarity.

For RAG, you want asymmetric retrieval: the model is trained on (query, relevant passage) pairs. Often it has separate encoding modes (“query” vs “passage”).

The two-tower paradigm

Most retrieval embedders have:

  • A query encoder.
  • A passage encoder.

Sometimes shared weights, sometimes separate. Trained contrastively: pull positive pairs close, push negatives apart.

Modern variants often add instruction-tuning: the model accepts a task description as part of its input.

The MTEB leaderboard

The Massive Text Embedding Benchmark — covers 50+ tasks across retrieval, classification, clustering, semantic similarity. Updated weekly.

Don’t blindly pick the top model. MTEB has known overfitting; some models score well by being trained on benchmark-adjacent data. Always test on your data.

Closed-source options (early 2026)

  • OpenAI text-embedding-3-large — 3072 dims (or smaller via Matryoshka), strong general-purpose, multilingual.
  • OpenAI text-embedding-3-small — cheaper, 1536 dims, surprisingly competitive.
  • Voyage AI voyage-3-large — high-quality, configurable dims, domain variants (code, finance, law).
  • Cohere embed-v4 — multilingual, multimodal (text + image), good for production.
  • Anthropic doesn’t publish embeddings; you need a separate provider.
  • Google Gemini embeddings — competitive, multilingual.

Open-weights options

  • bge-large-en-v1.5 (BAAI) — strong English baseline.
  • bge-m3 — multilingual, multi-functional (dense, sparse, multi-vector).
  • mxbai-embed-large-v1 — strong open model.
  • gte-large-en-v1.5 — Alibaba, strong general performance.
  • nomic-embed-text-v2-moe — MoE-based open model.
  • jina-embeddings-v3 — long-context, late-chunking-friendly.
  • stella_en_1.5B_v5 — competitive 1.5B param model.
  • NV-Embed-v2 — NVIDIA’s decoder-based embedder, competitive with closed.
  • Qwen3-Embedding family — newer Qwen-based embedders.

Specialized embedders

DomainModel
Codevoyage-code-3, nomic-embed-code, CodeRankEmbed
Multilingualmultilingual-e5-large-instruct, bge-m3, jina-embeddings-v3
Long contextjina-embeddings-v3 (8k+), nomic-embed-text-v2 (long-doc)
Multimodal (text+image)CLIP, SigLIP, voyage-multimodal-3, OpenCLIP
LegalLegalBERT-derived models, custom fine-tuned
BiomedicalBioBERT-derived, MedEmbed
FinanceFinBERT-derived, finance-embeddings-investopedia

For domain-specific apps, a domain-tuned embedder usually beats the latest general SOTA.

Decoder-based embedders

A 2024 trend: take a pretrained decoder LLM, fine-tune the last layers (or use mean-pooling of hidden states) for embedding. Surprisingly often beats encoder-only embedders.

Examples: NV-Embed-v2, GritLM, E5-Mistral, Qwen3-Embedding.

Why it works: decoders are bigger (more capacity), trained on more diverse data, and instruction-following naturally extends to “embed this for retrieval.”

Matryoshka embeddings (MRL)

Trained so that any prefix of the vector is a valid (lower-dim) embedding.

A 1024-dim Matryoshka embedding lets you:

  • Use 64 dims for fast first-stage filtering.
  • Use 256 dims for re-scoring.
  • Use full 1024 for final ranking.

Modern OpenAI, Voyage, and several open models support this. Lets you tune storage vs quality at retrieval time, not training time.

Multi-vector retrieval (ColBERT-style)

Instead of one vector per chunk, produce many — one per token. Match queries token-by-token against passages token-by-token.

  • More accurate (especially for long passages).
  • More expensive (storage and compute).
  • Used in ColBERT, JaColBERT, and as late-interaction re-rankers.

For most apps, a single dense vector + a cross-encoder re-ranker beats raw multi-vector retrieval on a cost basis.

Sparse + dense (hybrid)

We cover this in hybrid-search-and-reranking.md. Briefly: combine BM25 (sparse keyword matching) with dense embeddings. Often beats either alone.

bge-m3 and Splade produce both dense and sparse output from the same model — convenient.

Choosing for your use case

ScenarioRecommended starting point
English RAG, generaltext-embedding-3-large or voyage-3-large
Open-source preferredbge-large-en-v1.5 or bge-m3
Code searchvoyage-code-3
Multilingual (Asia, Europe)bge-m3 or voyage-multilingual-2
Long context per chunk (>512 tokens)jina-embeddings-v3 or use chunked OpenAI
Latency-criticaltext-embedding-3-small with MRL → 256 dims
Budget-constrainedtext-embedding-3-small
Multimodal (text + image)voyage-multimodal-3 or CLIP/SigLIP

Evaluating an embedder for your data

  1. Build a small gold set: 50–200 (query, relevant_doc) pairs from your domain.
  2. For each query, retrieve top-k from your corpus.
  3. Measure recall@k — fraction of queries where the relevant doc is in top-k.
  4. Compare across embedders.

Tools: BEIR-style eval, mteb library, custom scripts.

Don’t trust generic leaderboards alone. Your data is unique.

Domain adaptation

If a generic embedder underperforms on your data:

  1. Fine-tune with contrastive loss on (query, positive) pairs from your domain. See Stage 10 — Embedding fine-tuning.
  2. Use synthetic pairs: have an LLM generate questions for each document, fine-tune.
  3. Use hard negatives: mine confusable but irrelevant docs to push apart.

Even a few thousand domain pairs can give a 5–15-point recall jump.

Operational concerns

  • Vector dimension drives storage cost. 1024-dim vs 1536-dim matters at billion-scale.
  • Embedding model upgrades = re-embed everything. Plan for it; version your indexes.
  • Latency: embedding APIs add 50–200ms per query. Cache aggressively.
  • Cost: at scale, embedding costs add up. Smaller dims, batch embedding, MRL truncation help.
  • Quantization: int8 or float16 vector storage often saves 4×–8× with minimal quality loss.

Pitfalls

  • Mixing models: vectors from text-embedding-3-small and bge-large are not comparable. Don’t mix.
  • Forgetting to use the right “mode”: many models distinguish query vs passage encoding. Check the docs.
  • Ignoring instructions: instruction-tuned embedders work better when given the task description (“Represent this query for retrieval…”).
  • Stale embeddings: source docs updated, embeddings not refreshed. Build re-indexing into your pipeline.

See also