Hybrid Search & Reranking

Dense embeddings are great at semantic matching. Keyword search (BM25) is great at exact matches. Combining them — and adding a reranker on top — is the dominant pattern for high-quality production retrieval.

Why hybrid

Dense embeddings can fail when:

  • The query is mostly proper nouns or acronyms (a PII field name, a product code, a person’s name).
  • The query is short and lacks semantic content (3 keywords).
  • The relevant passage uses different vocabulary than the query — exact match suddenly matters.

Keyword search (BM25) fails when:

  • The query asks something semantic and the relevant passage uses synonyms or paraphrases.
  • The query is in natural language and the docs are short and structured.

Together, they cover each other’s blind spots.

The classic information-retrieval scoring function. Roughly: how many query terms appear in the document, weighted by:

  • TF (term frequency in doc) — saturating.
  • IDF (rarity across corpus) — rarer terms count more.
  • Document length normalization.
BM25(d, q) = Σ_t IDF(t) · (TF(t,d)·(k₁+1)) / (TF(t,d) + k₁·(1−b + b·|d|/avgdl))

Default parameters (k₁=1.2, b=0.75) work well across domains. BM25 is fast, deterministic, language-agnostic in spirit (with proper tokenization).

Available in: Elasticsearch, OpenSearch, Vespa, Postgres FTS, rank_bm25 Python lib, Tantivy/Lucene.

Combining dense and sparse

Three common strategies.

Reciprocal Rank Fusion (RRF)

Simple, robust. For each result, sum 1 / (k + rank_i) across all retrievers (k ≈ 60).

def rrf(results_lists, k=60):
    scores = {}
    for results in results_lists:
        for rank, doc_id in enumerate(results, start=1):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(scores, key=scores.get, reverse=True)

No hyperparameter tuning. Strong baseline.

Linear combination

Weighted sum of normalized scores:

final = α · dense_score + (1−α) · bm25_score

Requires score normalization (different scales) and tuning α. Sometimes outperforms RRF if you tune well.

Learned fusion

Train a small model on (query, candidate_docs, relevance) to predict relevance from both scores plus other features. The most accurate but the most complex; mostly used at large scale.

Some DBs have it built in:

  • Weaviate: near_text + bm25 with alpha blending.
  • Qdrant: hybrid queries with payload-based BM25 + vectors.
  • Vespa: highly customizable rank profiles mixing many signals.
  • Elasticsearch / OpenSearch: kNN queries combined with text queries via boolean composition.
  • pgvector + Postgres FTS: combine in SQL.

For others, you run two queries and merge in your application.

Reranking

After your initial retrieval (top-100 or so), use a more expensive model to re-score the candidates.

Why rerank

Dense retrieval is fast but coarse. A cross-encoder (which sees query + doc together) is slow but accurate. Run dense on millions, cross-encode on top-100, return top-10. This is the bi-encoder + cross-encoder pattern.

Cross-encoders

Take query + document concatenated; output a single relevance score:

[CLS] query [SEP] document → relevance ∈ [0, 1]

Available models:

  • Cohere rerank-3.5 — strong, multilingual, easy API.
  • bge-reranker-v2-m3 — open, multilingual.
  • mxbai-rerank-large-v1 — open, strong English.
  • jina-reranker-v2 — long-context.
  • Voyage rerank-2 — high-quality.

API call latency: 50–500ms for rerank of 100 candidates.

LLM-as-reranker

Use a cheap LLM (Haiku, Gemini Flash) to score each candidate. Slower than dedicated rerankers but more flexible:

Score the relevance of this passage to the query on a 0-10 scale.
Query: ...
Passage: ...

Usable for niche domains where dedicated rerankers underperform.

Listwise reranking

Show the model all candidates at once and ask for a ranking:

Rank these passages from most to least relevant to the query.
1. ...
2. ...
3. ...

GPT-4-class models can do this well. Expensive but sometimes the best option for top-quality retrieval.

Putting it together — production retrieval pipeline

def retrieve(query, k=10):
    # 1. Query transformation (optional)
    rewritten = rewrite_query(query)

    # 2. Hybrid retrieval — get a wide net
    dense = dense_search(rewritten, k=50)
    sparse = bm25_search(rewritten, k=50)
    candidates = rrf([dense, sparse])[:50]

    # 3. Apply metadata filters
    candidates = filter_by_metadata(candidates, user_filters)

    # 4. Rerank with cross-encoder
    reranked = rerank(query, candidates, top_k=k)

    # 5. Return with citations
    return reranked

This pipeline is the production default for high-quality RAG.

Score thresholds

A relevance score of 0.7 means different things across models. Don’t hardcode thresholds — calibrate.

  • For each query in your eval set, get the score of the best relevant doc.
  • Find a threshold that captures most relevant docs while filtering most irrelevant.
  • Set conservatively; let the LLM say “I don’t know” rather than answer from low-relevance retrieval.

Multilingual considerations

  • BM25 works in any language with proper tokenization (don’t apply English stemming to French!).
  • Dense embedders vary in multilingual quality — pick a multilingual model.
  • Hybrid search is even more important multilingually because keyword matching isn’t language-dependent in the same way as embeddings.

Cost and latency tradeoffs

Adding re-ranking:

  • 50–500ms extra latency.
  • ~$0.001–$0.01 per query.
  • 5–15 percentage points improvement on recall@k for relevant docs.

Adding hybrid:

  • Some extra storage (BM25 index).
  • Negligible extra query latency if running in parallel.
  • 5–15 points on certain query types (acronyms, codes, exact-match).

For most production RAG: dense + rerank is the highest-leverage combo. Add BM25 if you have many keyword-heavy queries or proper-noun-heavy data.

Pitfalls

  • Tokenization mismatch between BM25 and your text — common when languages or formatting differ.
  • Rerank top-100 isn’t enough if your initial retrieval misses relevant docs entirely. Hybrid search fixes this.
  • Reranker context limits — long passages get truncated. Use the right reranker for your chunk size.
  • Score normalization done wrong in linear combination — RRF avoids this.
  • Rerank everything — if you have 10M candidates, rerank top-100, not all of them. Recall first, precision second.

Watch it interactively

  • RAG Visualizer — same query, three strategies (dense / BM25 / hybrid), three rankings side-by-side on a real corpus. Predict before clicking: for “vegetarian options” dense wins (paraphrase); for “DATABASE_URL” BM25 wins (exact-match keyword); hybrid never loses. The score columns show why.
  • Reranker Lab — bi-encoder vs cross-encoder rerank. Watch the rank shuffle live.

Build it in code

See also