Hybrid Search & Reranking

Dense embeddings are great at semantic matching. Keyword search (BM25) is great at exact matches. Combining them — and adding a reranker on top — is the dominant pattern for high-quality production retrieval.

Why hybrid

Dense embeddings can fail when:

The query is mostly proper nouns or acronyms (a PII field name, a product code, a person’s name).
The query is short and lacks semantic content (3 keywords).
The relevant passage uses different vocabulary than the query — exact match suddenly matters.

Keyword search (BM25) fails when:

The query asks something semantic and the relevant passage uses synonyms or paraphrases.
The query is in natural language and the docs are short and structured.

Together, they cover each other’s blind spots.

BM25 (keyword search)

The classic information-retrieval scoring function. Roughly: how many query terms appear in the document, weighted by:

TF (term frequency in doc) — saturating.
IDF (rarity across corpus) — rarer terms count more.
Document length normalization.

BM25(d, q) = Σ_t IDF(t) · (TF(t,d)·(k₁+1)) / (TF(t,d) + k₁·(1−b + b·|d|/avgdl))

Default parameters (k₁=1.2, b=0.75) work well across domains. BM25 is fast, deterministic, language-agnostic in spirit (with proper tokenization).

Available in: Elasticsearch, OpenSearch, Vespa, Postgres FTS, rank_bm25 Python lib, Tantivy/Lucene.

Combining dense and sparse

Three common strategies.

Reciprocal Rank Fusion (RRF)

Simple, robust. For each result, sum 1 / (k + rank_i) across all retrievers (k ≈ 60).

def rrf(results_lists, k=60):
    scores = {}
    for results in results_lists:
        for rank, doc_id in enumerate(results, start=1):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
    return sorted(scores, key=scores.get, reverse=True)

No hyperparameter tuning. Strong baseline.

Linear combination

Weighted sum of normalized scores:

final = α · dense_score + (1−α) · bm25_score

Requires score normalization (different scales) and tuning α. Sometimes outperforms RRF if you tune well.

Learned fusion

Train a small model on (query, candidate_docs, relevance) to predict relevance from both scores plus other features. The most accurate but the most complex; mostly used at large scale.

Native hybrid search

Some DBs have it built in:

Weaviate: near_text + bm25 with alpha blending.
Qdrant: hybrid queries with payload-based BM25 + vectors.
Vespa: highly customizable rank profiles mixing many signals.
Elasticsearch / OpenSearch: kNN queries combined with text queries via boolean composition.
pgvector + Postgres FTS: combine in SQL.

For others, you run two queries and merge in your application.

Reranking

After your initial retrieval (top-100 or so), use a more expensive model to re-score the candidates.

Why rerank

Dense retrieval is fast but coarse. A cross-encoder (which sees query + doc together) is slow but accurate. Run dense on millions, cross-encode on top-100, return top-10. This is the bi-encoder + cross-encoder pattern.

Cross-encoders

Take query + document concatenated; output a single relevance score:

[CLS] query [SEP] document → relevance ∈ [0, 1]

Available models:

Cohere rerank-3.5 — strong, multilingual, easy API.
bge-reranker-v2-m3 — open, multilingual.
mxbai-rerank-large-v1 — open, strong English.
jina-reranker-v2 — long-context.
Voyage rerank-2 — high-quality.

API call latency: 50–500ms for rerank of 100 candidates.

LLM-as-reranker

Use a cheap LLM (Haiku, Gemini Flash) to score each candidate. Slower than dedicated rerankers but more flexible:

Score the relevance of this passage to the query on a 0-10 scale.
Query: ...
Passage: ...

Usable for niche domains where dedicated rerankers underperform.

Listwise reranking

Show the model all candidates at once and ask for a ranking:

Rank these passages from most to least relevant to the query.
1. ...
2. ...
3. ...

GPT-4-class models can do this well. Expensive but sometimes the best option for top-quality retrieval.

Putting it together — production retrieval pipeline

def retrieve(query, k=10):
    # 1. Query transformation (optional)
    rewritten = rewrite_query(query)

    # 2. Hybrid retrieval — get a wide net
    dense = dense_search(rewritten, k=50)
    sparse = bm25_search(rewritten, k=50)
    candidates = rrf([dense, sparse])[:50]

    # 3. Apply metadata filters
    candidates = filter_by_metadata(candidates, user_filters)

    # 4. Rerank with cross-encoder
    reranked = rerank(query, candidates, top_k=k)

    # 5. Return with citations
    return reranked

This pipeline is the production default for high-quality RAG.

Score thresholds

A relevance score of 0.7 means different things across models. Don’t hardcode thresholds — calibrate.

For each query in your eval set, get the score of the best relevant doc.
Find a threshold that captures most relevant docs while filtering most irrelevant.
Set conservatively; let the LLM say “I don’t know” rather than answer from low-relevance retrieval.

Multilingual considerations

BM25 works in any language with proper tokenization (don’t apply English stemming to French!).
Dense embedders vary in multilingual quality — pick a multilingual model.
Hybrid search is even more important multilingually because keyword matching isn’t language-dependent in the same way as embeddings.

Cost and latency tradeoffs

Adding re-ranking:

50–500ms extra latency.
~$0.001–$0.01 per query.
5–15 percentage points improvement on recall@k for relevant docs.

Adding hybrid:

Some extra storage (BM25 index).
Negligible extra query latency if running in parallel.
5–15 points on certain query types (acronyms, codes, exact-match).

For most production RAG: dense + rerank is the highest-leverage combo. Add BM25 if you have many keyword-heavy queries or proper-noun-heavy data.

Pitfalls

Tokenization mismatch between BM25 and your text — common when languages or formatting differ.
Rerank top-100 isn’t enough if your initial retrieval misses relevant docs entirely. Hybrid search fixes this.
Reranker context limits — long passages get truncated. Use the right reranker for your chunk size.
Score normalization done wrong in linear combination — RRF avoids this.
Rerank everything — if you have 10M candidates, rerank top-100, not all of them. Recall first, precision second.

Watch it interactively

RAG Visualizer — same query, three strategies (dense / BM25 / hybrid), three rankings side-by-side on a real corpus. Predict before clicking: for “vegetarian options” dense wins (paraphrase); for “DATABASE_URL” BM25 wins (exact-match keyword); hybrid never loses. The score columns show why.
Reranker Lab — bi-encoder vs cross-encoder rerank. Watch the rank shuffle live.

Build it in code

/ship/08 — BM25 + dense + rerank — the full 3-stage pipeline in ~150 lines. Includes RRF + cross-encoder rerank.
/case-studies/01 — docs assistant — the same pipeline applied to a real product, with citation-quality eval numbers.