Pattern
Cross-encoder reranker
- Taxonomy
patterns.ranking- Category
- ranking
- Complexity
- medium
- When to use
- Top-K ANN recall is high but ordering is noisy; you can afford a few hundred ms of extra latency for sharper top-3.
- When NOT to use
- Single-result lookups or strict latency budgets under 50ms total.
What it is
A bi-encoder embeds query and document independently — fast but lossy. A cross-encoder takes (query, document) as one input and scores them jointly — slower but much sharper. The standard recipe: bi-encoder for retrieval (top-50 to top-200), cross-encoder for reranking (top-3 to top-10).
Common cross-encoders
- BGE reranker series (open weights, multilingual).
- Cohere Rerank (managed API, strong baseline).
- ColBERT-style late-interaction models for higher recall at the cost of more memory.
Trade-offs
- Latency is linear in K. Reranking 100 candidates with a 6B cross-encoder is not free.
- Quality lift is largest when the bi-encoder recall is high but precision@1 is low — which is the typical RAG failure mode.