Hybrid Search & Reranking
Dense embeddings are great at semantic matching. Keyword search (BM25) is great at exact matches. Combining them — and adding a reranker on top — is the dominant pattern for high-quality production retrieval.
Why hybrid
Dense embeddings can fail when:
- The query is mostly proper nouns or acronyms (a PII field name, a product code, a person’s name).
- The query is short and lacks semantic content (3 keywords).
- The relevant passage uses different vocabulary than the query — exact match suddenly matters.
Keyword search (BM25) fails when:
- The query asks something semantic and the relevant passage uses synonyms or paraphrases.
- The query is in natural language and the docs are short and structured.
Together, they cover each other’s blind spots.
BM25 (keyword search)
The classic information-retrieval scoring function. Roughly: how many query terms appear in the document, weighted by:
- TF (term frequency in doc) — saturating.
- IDF (rarity across corpus) — rarer terms count more.
- Document length normalization.
BM25(d, q) = Σ_t IDF(t) · (TF(t,d)·(k₁+1)) / (TF(t,d) + k₁·(1−b + b·|d|/avgdl))
Default parameters (k₁=1.2, b=0.75) work well across domains. BM25 is fast, deterministic, language-agnostic in spirit (with proper tokenization).
Available in: Elasticsearch, OpenSearch, Vespa, Postgres FTS, rank_bm25 Python lib, Tantivy/Lucene.
Combining dense and sparse
Three common strategies.
Reciprocal Rank Fusion (RRF)
Simple, robust. For each result, sum 1 / (k + rank_i) across all retrievers (k ≈ 60).
def rrf(results_lists, k=60):
scores = {}
for results in results_lists:
for rank, doc_id in enumerate(results, start=1):
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)
return sorted(scores, key=scores.get, reverse=True)
No hyperparameter tuning. Strong baseline.
Linear combination
Weighted sum of normalized scores:
final = α · dense_score + (1−α) · bm25_score
Requires score normalization (different scales) and tuning α. Sometimes outperforms RRF if you tune well.
Learned fusion
Train a small model on (query, candidate_docs, relevance) to predict relevance from both scores plus other features. The most accurate but the most complex; mostly used at large scale.
Native hybrid search
Some DBs have it built in:
- Weaviate:
near_text+bm25with alpha blending. - Qdrant: hybrid queries with payload-based BM25 + vectors.
- Vespa: highly customizable rank profiles mixing many signals.
- Elasticsearch / OpenSearch:
kNNqueries combined with text queries via boolean composition. - pgvector + Postgres FTS: combine in SQL.
For others, you run two queries and merge in your application.
Reranking
After your initial retrieval (top-100 or so), use a more expensive model to re-score the candidates.
Why rerank
Dense retrieval is fast but coarse. A cross-encoder (which sees query + doc together) is slow but accurate. Run dense on millions, cross-encode on top-100, return top-10. This is the bi-encoder + cross-encoder pattern.
Cross-encoders
Take query + document concatenated; output a single relevance score:
[CLS] query [SEP] document → relevance ∈ [0, 1]
Available models:
- Cohere
rerank-3.5— strong, multilingual, easy API. bge-reranker-v2-m3— open, multilingual.mxbai-rerank-large-v1— open, strong English.jina-reranker-v2— long-context.- Voyage
rerank-2— high-quality.
API call latency: 50–500ms for rerank of 100 candidates.
LLM-as-reranker
Use a cheap LLM (Haiku, Gemini Flash) to score each candidate. Slower than dedicated rerankers but more flexible:
Score the relevance of this passage to the query on a 0-10 scale.
Query: ...
Passage: ...
Usable for niche domains where dedicated rerankers underperform.
Listwise reranking
Show the model all candidates at once and ask for a ranking:
Rank these passages from most to least relevant to the query.
1. ...
2. ...
3. ...
GPT-4-class models can do this well. Expensive but sometimes the best option for top-quality retrieval.
Putting it together — production retrieval pipeline
def retrieve(query, k=10):
# 1. Query transformation (optional)
rewritten = rewrite_query(query)
# 2. Hybrid retrieval — get a wide net
dense = dense_search(rewritten, k=50)
sparse = bm25_search(rewritten, k=50)
candidates = rrf([dense, sparse])[:50]
# 3. Apply metadata filters
candidates = filter_by_metadata(candidates, user_filters)
# 4. Rerank with cross-encoder
reranked = rerank(query, candidates, top_k=k)
# 5. Return with citations
return reranked
This pipeline is the production default for high-quality RAG.
Score thresholds
A relevance score of 0.7 means different things across models. Don’t hardcode thresholds — calibrate.
- For each query in your eval set, get the score of the best relevant doc.
- Find a threshold that captures most relevant docs while filtering most irrelevant.
- Set conservatively; let the LLM say “I don’t know” rather than answer from low-relevance retrieval.
Multilingual considerations
- BM25 works in any language with proper tokenization (don’t apply English stemming to French!).
- Dense embedders vary in multilingual quality — pick a multilingual model.
- Hybrid search is even more important multilingually because keyword matching isn’t language-dependent in the same way as embeddings.
Cost and latency tradeoffs
Adding re-ranking:
- 50–500ms extra latency.
- ~$0.001–$0.01 per query.
- 5–15 percentage points improvement on recall@k for relevant docs.
Adding hybrid:
- Some extra storage (BM25 index).
- Negligible extra query latency if running in parallel.
- 5–15 points on certain query types (acronyms, codes, exact-match).
For most production RAG: dense + rerank is the highest-leverage combo. Add BM25 if you have many keyword-heavy queries or proper-noun-heavy data.
Pitfalls
- Tokenization mismatch between BM25 and your text — common when languages or formatting differ.
- Rerank top-100 isn’t enough if your initial retrieval misses relevant docs entirely. Hybrid search fixes this.
- Reranker context limits — long passages get truncated. Use the right reranker for your chunk size.
- Score normalization done wrong in linear combination — RRF avoids this.
- Rerank everything — if you have 10M candidates, rerank top-100, not all of them. Recall first, precision second.
Watch it interactively
- RAG Visualizer — same query, three strategies (dense / BM25 / hybrid), three rankings side-by-side on a real corpus. Predict before clicking: for “vegetarian options” dense wins (paraphrase); for “DATABASE_URL” BM25 wins (exact-match keyword); hybrid never loses. The score columns show why.
- Reranker Lab — bi-encoder vs cross-encoder rerank. Watch the rank shuffle live.
Build it in code
/ship/08— BM25 + dense + rerank — the full 3-stage pipeline in ~150 lines. Includes RRF + cross-encoder rerank./case-studies/01— docs assistant — the same pipeline applied to a real product, with citation-quality eval numbers.