demo

Cross-encoders fix what bi-encoders break

Initial retrieval gives you the top-100 candidates. A cross-encoder reranker reads each (query, chunk) pair together and re-orders them. The single biggest quality lever in production RAG.

Bi-encoder vs cross-encoder — what's actually different

# bi-encoder (initial retrieval, FAST):
emb_q  = encoder(query)         # encoded ONCE per query
emb_d  = encoder(doc)           # encoded ONCE per doc, cached at index time
score  = cos_sim(emb_q, emb_d)  # cheap dot product per candidate

# cross-encoder (rerank, SLOW but ACCURATE):
score  = encoder(query, doc)    # query and doc go in together
                                 # the encoder cross-attends across them
                                 # → much richer score, but a fresh forward
                                 # pass per pair

The cross-encoder catches things the bi-encoder misses because it can attend across the query and the chunk together. Common example: "Did Apple sue Samsung?" — bi-encoder might rank a paragraph about apples and samsung together highly; cross-encoder catches that the chunk doesn't actually answer the legal question.

Try this — predict before you click

  1. Pick the first scenario. Look at the bi-encoder ranking vs the cross-encoder ranking. Predict: the rankings differ on at least 1–2 chunks. The "lift" stat shows the gain.
  2. Pick a scenario where dense retrieval already had the right answer at #1. Predict: cross-encoder rerank doesn't change the top result — but it does shuffle middle ranks (5–10). Reranking is most useful when the bi-encoder is unsure between several plausible candidates.
  3. Look at scenarios where the dense top-1 is wrong. Predict: these are the cases where cross-encoder rerank pays off most. Production RAG stacks pull the top-50 from dense and rerank to top-5 — the rerank's quality lift on those middle ranks is the difference between "answer is in context" and "answer is missing".
  4. The trade-off: cross-encoder is ~100× slower per pair than dense cosine. Predict: production systems use dense to get top-50 cheaply, then rerank only those 50 with the cross-encoder. Total latency = single dense pass + 50 cross-encoder passes ≪ 50K cross-encoder passes over the full index.

Anchored to 09-rag/hybrid-search-and-reranking. Code-side: /ship/08 — retrieval (BM25 + dense + rerank).