production-stack building 08 / 17 24 min read · 35 min hands-on

step 08 · ship · building

Retrieval: BM25 + dense + reranking

The three-stage pipeline that beats any single strategy. Hybrid retrieval, RRF fusion, and a cross-encoder reranker as the silent quality lever.

ragretrieval

Dense retrieval (step 07) finds semantic matches. Ask “How do I install PostgreSQL?” and it finds the “Installing the Server” chunk even if “PostgreSQL” doesn’t appear in the chunk text. Magic.

But dense retrieval has a blind spot: rare keywords. Search for “DATABASE_URL” and dense will return chunks about database connection strings in general — not necessarily the chunk that has the literal token “DATABASE_URL.” For a user who’s pasted an error message verbatim, that’s a failure.

The fix is straightforward and is what virtually every production RAG system does: run both dense and lexical (BM25) retrieval, combine the rankings, then rerank the top with a cross-encoder. Three stages, each cheap, each compensates for the others’ weaknesses. By the end of this step, your retrieval will catch what step 07’s missed.

The three stages

In order:

Stage 1 — Lexical (BM25)

Keyword-matching with TF-IDF-like scoring, refined by document length normalization. The classic information-retrieval algorithm; what every search engine ran before neural retrieval. Strengths: literal-token matches, rare terms, exact codes/identifiers. Weakness: no semantic understanding (“car” and “vehicle” are unrelated to BM25).

Stage 2 — Dense (cosine)

What we built in step 07. Strengths: paraphrase, synonymy, semantic similarity. Weakness: rare keywords get diluted in the embedding space.

Stage 3 — Reciprocal Rank Fusion (RRF)

Combine the two rankings without needing to compare scores. For each candidate c, sum 1 / (k + rank_in_dense) and 1 / (k + rank_in_bm25) (with k=60 per the original paper). The candidate with the highest fused score wins. Doesn’t require either method’s scores to be on the same scale, which is what makes it the standard.

Stage 4 — Cross-encoder rerank

Take the top ~20 hits from RRF and re-rank them with a cross-encoder — a model that scores (query, chunk) pairs jointly rather than embedding them separately. Much more accurate than bi-encoder cosine, but quadratic in the number of pairs, so we only do it on the top-K. The “single biggest quality lever in production RAG.”

Setup

uv add rank-bm25

rank-bm25 is a tiny in-memory implementation. For larger corpora you’d want a real index (Tantivy, Meilisearch, or postgres FTS); for the curriculum’s scale, in-memory is fine and visible.

We’ll also pull in a cross-encoder model:

# Already have sentence-transformers from step 07
# We'll load a CrossEncoder from the same package.

Open the new file:

# stack/retrieve.py
from __future__ import annotations
import re
from collections import defaultdict
from dataclasses import dataclass
from pathlib import Path
import numpy as np
from rank_bm25 import BM25Okapi
from sentence_transformers import CrossEncoder

from stack.embed import VectorStore, Embedder, Hit

BM25 retrieval

# stack/retrieve.py (continuing)
class BM25Index:
    """In-memory BM25 over the same chunks indexed in the VectorStore.

    Loaded from the SQLite store on startup; rebuilt if the store changes.
    For larger corpora swap in Tantivy or Postgres FTS — same interface.
    """

    def __init__(self, store: VectorStore):
        self.store = store
        self.chunk_ids: list[int] = []
        self.texts: list[str] = []
        self.metadata: list[dict] = []
        self.bm25: BM25Okapi | None = None
        self._reload()

    def _tokenize(self, text: str) -> list[str]:
        """Cheap whitespace + lowercase tokenization. Production uses a
        real tokenizer; for retrieval over English prose, this is fine."""
        return re.findall(r"\b\w+\b", text.lower())

    def _reload(self) -> None:
        cur = self.store._conn.cursor()
        cur.execute("SELECT id, text, metadata FROM chunks ORDER BY id")
        rows = cur.fetchall()

        self.chunk_ids = [r[0] for r in rows]
        self.texts = [r[1] for r in rows]
        import json as _json
        self.metadata = [_json.loads(r[2]) for r in rows]

        tokenized = [self._tokenize(t) for t in self.texts]
        if tokenized:
            self.bm25 = BM25Okapi(tokenized)

    def search(self, query: str, k: int = 20) -> list[Hit]:
        """Return top-k chunks by BM25 score for the given query."""
        if self.bm25 is None:
            return []
        tokens = self._tokenize(query)
        scores = self.bm25.get_scores(tokens)
        # Top-k by score.
        top_idx = np.argsort(scores)[::-1][:k]
        hits: list[Hit] = []
        for i in top_idx:
            if scores[i] <= 0:
                continue
            hits.append(Hit(
                chunk_id=self.chunk_ids[i],
                score=float(scores[i]),
                text=self.texts[i],
                metadata=self.metadata[i],
            ))
        return hits

Two things worth noting:

The index lives in memory. For a 100K-chunk corpus that’s ~50 MB of token lists; not a problem. For 10M+ chunks, swap in a real lexical index. The search interface is unchanged.

Tokenization is intentionally cheap. Lowercase, split on word boundaries. No stemming, no stop-word removal. Modern thinking: BM25’s job is to be the literal-match arm of the pipeline; let dense handle paraphrase. If you stem aggressively, you blur both arms toward the same answer.

Reciprocal Rank Fusion

# stack/retrieve.py (continuing)
def rrf_fuse(
    rankings: list[list[Hit]],
    k: int = 60,
    top_k: int = 20,
) -> list[Hit]:
    """Combine multiple rankings via Reciprocal Rank Fusion.

    For each candidate c in any input ranking:
        rrf_score(c) = sum over rankings of 1 / (k + rank_in_that_ranking)

    Returns top_k by fused score. Doesn't need scores from inputs to be
    comparable — only their rank order matters.
    """
    fused: dict[int, float] = defaultdict(float)
    text_lookup: dict[int, Hit] = {}

    for ranking in rankings:
        for rank, hit in enumerate(ranking, start=1):
            fused[hit.chunk_id] += 1.0 / (k + rank)
            text_lookup[hit.chunk_id] = hit

    # Sort by fused score desc.
    sorted_ids = sorted(fused.keys(), key=lambda cid: -fused[cid])
    out: list[Hit] = []
    for cid in sorted_ids[:top_k]:
        original = text_lookup[cid]
        out.append(Hit(
            chunk_id=cid,
            score=fused[cid],         # the RRF score now
            text=original.text,
            metadata=original.metadata,
        ))
    return out

The function takes any number of input rankings and produces one fused output. We use it for two rankings (dense + BM25), but it generalizes — you could fuse three (add a sparse-attention method, a custom-trained retriever) by appending another list to the input.

Cross-encoder rerank

# stack/retrieve.py (continuing)
RERANKER_MODEL = "cross-encoder/ms-marco-MiniLM-L-6-v2"


class Reranker:
    """Cross-encoder that scores (query, chunk) pairs jointly.

    Much higher quality than bi-encoder cosine, ~100× slower per pair,
    so we only use it on the top-K candidates from hybrid retrieval.
    """

    def __init__(self, model_name: str = RERANKER_MODEL):
        self.model_name = model_name
        self._model: CrossEncoder | None = None

    @property
    def model(self) -> CrossEncoder:
        if self._model is None:
            self._model = CrossEncoder(self.model_name, max_length=512)
        return self._model

    def rerank(
        self,
        query: str,
        hits: list[Hit],
        top_k: int = 5,
    ) -> list[Hit]:
        """Rescore each hit against the query, return the top_k."""
        if not hits:
            return []
        pairs = [(query, hit.text) for hit in hits]
        scores = self.model.predict(pairs)  # (n,) numpy array
        # Sort by reranker score desc.
        scored = sorted(zip(scores, hits), key=lambda x: -x[0])
        out: list[Hit] = []
        for new_score, hit in scored[:top_k]:
            out.append(Hit(
                chunk_id=hit.chunk_id,
                score=float(new_score),
                text=hit.text,
                metadata=hit.metadata,
            ))
        return out

The cross-encoder is much more accurate than the bi-encoder we used for retrieval, because it gets to look at the query and the candidate together and apply attention across both. The trade-off: we’d never want to run it across all 100K chunks (that’s 100K forward passes per query). So we use cheap retrieval to find the top-20 candidates, then rerank only those.

The Reranker Lab demo shows real-model behavior: it uses this exact cross-encoder/ms-marco-MiniLM-L-6-v2 model on a small corpus, and you can watch the rerank shuffle the dense ranking.

The full hybrid pipeline

# stack/retrieve.py (continuing)
@dataclass
class HybridConfig:
    n_candidates_per_method: int = 20   # how many to fetch from each retriever
    n_after_fusion: int = 20            # RRF output (cap before rerank)
    n_after_rerank: int = 5             # final return count


class HybridRetriever:
    """Three-stage retrieval: dense + BM25 → RRF fuse → cross-encoder rerank."""

    def __init__(
        self,
        store: VectorStore,
        embedder: Embedder,
        config: HybridConfig | None = None,
    ):
        self.store = store
        self.embedder = embedder
        self.bm25 = BM25Index(store)
        self.reranker = Reranker()
        self.config = config or HybridConfig()

    def retrieve(self, query: str) -> list[Hit]:
        cfg = self.config

        # Stage 1: dense + BM25, in parallel (single-machine, so just sequential).
        q_emb = self.embedder.embed([query])[0]
        dense_hits = self.store.search(q_emb, k=cfg.n_candidates_per_method)
        bm25_hits = self.bm25.search(query, k=cfg.n_candidates_per_method)

        # Stage 2: fuse via RRF.
        fused = rrf_fuse([dense_hits, bm25_hits], top_k=cfg.n_after_fusion)

        # Stage 3: cross-encoder rerank top-K.
        return self.reranker.rerank(query, fused, top_k=cfg.n_after_rerank)

Three lines in retrieve(), each calling one stage. Easy to enable/disable individual layers (you might skip BM25 if your queries are always semantic, or skip rerank if you need under 10 ms latency).

Sanity check

A __main__ block that compares dense-only, BM25-only, and the full pipeline:

# stack/retrieve.py (bottom)
if __name__ == "__main__":
    import time
    from stack.embed import VectorStore, Embedder

    # Assumes you've run step 07's __main__ first to populate the store.
    store = VectorStore("data/chunks.db")
    embedder = Embedder()

    if store.count() == 0:
        print("No chunks indexed. Run `uv run python -m stack.embed` first.")
        raise SystemExit(1)

    print(f"corpus: {store.count()} chunks indexed")

    retriever = HybridRetriever(store, embedder)

    queries = [
        "How do I install PostgreSQL?",
        "DATABASE_URL format",         # exact-keyword query, BM25-friendly
        "running migrations",          # paraphrase query, dense-friendly
    ]

    for query in queries:
        print(f"\n══ {query!r} ══")

        # Each stage individually for comparison.
        q_emb = embedder.embed([query])[0]

        t0 = time.perf_counter()
        dense_hits = store.search(q_emb, k=3)
        t_dense = (time.perf_counter() - t0) * 1000

        t0 = time.perf_counter()
        bm25_hits = retriever.bm25.search(query, k=3)
        t_bm25 = (time.perf_counter() - t0) * 1000

        t0 = time.perf_counter()
        full = retriever.retrieve(query)
        t_full = (time.perf_counter() - t0) * 1000

        for label, results, t in [
            ("dense only ", dense_hits, t_dense),
            ("bm25 only  ", bm25_hits, t_bm25),
            ("full hybrid", full[:3], t_full),
        ]:
            print(f"\n  {label}  [{t:.1f} ms]")
            for hit in results:
                heading = " > ".join(hit.metadata.get("headings", []))
                preview = hit.text[:55].replace("\n", " ⏎ ")
                print(f"    {hit.score:.3f}  {heading}")
                print(f"           {preview}…")

    store.close()
uv run python -m stack.retrieve

Expected output (approximate; scores depend on the model):

corpus: 4 chunks indexed

══ 'How do I install PostgreSQL?' ══

  dense only   [4.2 ms]
    0.624  Setting Up the Database > Installing the Server
           ## Installing the Server  ⏎  ⏎ We use PostgreSQL …
    0.418  Setting Up the Database
           # Setting Up the Database  ⏎  ⏎ Before you can ru…

  bm25 only    [0.3 ms]
    1.347  Setting Up the Database > Installing the Server
           ## Installing the Server  ⏎  ⏎ We use PostgreSQL …

  full hybrid  [187.4 ms]
    8.237  Setting Up the Database > Installing the Server
           ## Installing the Server  ⏎  ⏎ We use PostgreSQL …

══ 'DATABASE_URL format' ══

  dense only   [3.1 ms]
    0.512  Setting Up the Database > Connection Strings
           ## Connection Strings  ⏎  ⏎ The application reads…
    0.341  Setting Up the Database > Installing the Server
           ## Installing the Server  ⏎  ⏎ We use PostgreSQL …

  bm25 only    [0.2 ms]
    2.193  Setting Up the Database > Connection Strings
           ## Connection Strings  ⏎  ⏎ The application reads…

  full hybrid  [142.3 ms]
    7.812  Setting Up the Database > Connection Strings
           ## Connection Strings  ⏎  ⏎ The application reads…

══ 'running migrations' ══

  dense only   [3.4 ms]
    0.706  Setting Up the Database > Creating the Schema
           ## Creating the Schema  ⏎  ⏎ Run the migrations …
    0.395  Setting Up the Database > Connection Strings
           ## Connection Strings  ⏎  ⏎ The application reads…

  bm25 only    [0.2 ms]
    1.896  Setting Up the Database > Creating the Schema
           ## Creating the Schema  ⏎  ⏎ Run the migrations …

  full hybrid  [134.7 ms]
    8.456  Setting Up the Database > Creating the Schema
           ## Creating the Schema  ⏎  ⏎ Run the migrations …

Two takeaways at this corpus size:

  1. All three methods agree on top-1 here because the corpus is tiny and the queries are unambiguous. You’d see real rank shuffles emerge at 1000+ chunks where ambiguity is common.
  2. Latency budget: dense ~3 ms, BM25 ~0.3 ms, full hybrid ~150 ms (almost all of it the reranker on CPU). On GPU the reranker drops to ~20 ms.

When to skip stages

Production teams sometimes drop one of the stages. The trade-offs:

  • Skip BM25 if your queries are virtually never literal-keyword (chat-style, paraphrasing). Saves ~1 ms latency, simpler code, ~5% recall hit on edge cases.
  • Skip rerank if latency matters more than recall (real-time autocomplete, sub-50 ms requirements). Saves 100–300 ms on CPU. Recall stays decent if dense+BM25 are well-tuned.
  • Use rerank only conditionally (rerank-when-uncertain): if dense top-1 cosine > 0.8, skip rerank. Saves latency on the easy queries.

The default for a typical RAG application is “all three, always.” Optimize only when profiling shows a specific bottleneck.

Cross-references

What we did and didn’t do

What we did:

  • BM25 lexical retrieval over the same chunks the vector store has
  • Reciprocal Rank Fusion (the rank-based combiner that beats score-blending)
  • Cross-encoder rerank as the final quality lever
  • A HybridRetriever class that ties all three behind a single retrieve(query) call
  • Side-by-side comparison so the trade-offs are visible

What we didn’t:

  • Tune BM25 hyperparameters (k1 and b). Default values from the original paper work fine for general English; tune only if you have a quality target you’re not hitting.
  • Index BM25 in a real engine (Tantivy, Postgres FTS, Elasticsearch). For corpora under 1M chunks, in-memory rank-bm25 is plenty.
  • Train a custom reranker. Cross-encoders fine-tuned on your domain crush the off-the-shelf one. Worth doing if recall is your bottleneck and you have labeled (query, relevant_chunk) pairs.
  • Add metadata filtering (WHERE source = 'docs/api/'). Step 09 introduces it via the tool layer; for retrieval-only use, sqlite-vec’s MATCH syntax composes with regular WHERE clauses.

Next

Step 09 is tools and function calling — letting the model invoke functions in your stack as part of generation. Schema design, OSS-model adapters, the dance between the LLM saying “I’d like to call search_docs(query='...')” and your code actually executing it. After step 09 you’ll have the building block agents need.