Reranking
Reranking is a second-pass scoring that improves retrieval precision. While initial retrieval (vector search) optimizes for recall, reranking optimizes for relevance to the specific query.
Why Rerank?
Consider this example:
| Query | ”How to train a neural network” |
|---|---|
| Vector search returns | Documents about training, neural networks, networks in general |
| After reranking | Documents specifically about training neural networks |
Initial retrieval casts a wide net. Reranking picks the best fish.
Available Strategies
Vertex AI Ranking
Google’s neural cross-encoder reranking service.
[rag]reranking_enabled = truereranker = "vertex"How it works:
- Takes query + document as input
- Jointly encodes them (cross-attention)
- Outputs a relevance score
Advantages:
- State-of-the-art accuracy
- Handles long documents well
- No local compute needed
Requirements:
- Google Cloud project
- Vertex AI enabled
MMR (Maximal Marginal Relevance)
Balances relevance with diversity.
[rag]reranking_enabled = truereranker = "mmr"mmr_lambda = 0.5 # 0 = max diversity, 1 = max relevanceHow it works:
MMR = λ × Relevance(doc, query) - (1-λ) × max(Similarity(doc, selected_docs))Iteratively selects documents that are:
- Relevant to the query
- Different from already-selected documents
When to use:
- Results are too similar
- Need diverse perspectives
- Exploratory queries
None (Disabled)
Use initial retrieval scores only.
[rag]reranking_enabled = falseWhen to use:
- Latency-critical applications
- Already high-quality initial retrieval
- Testing/development
Configuration
Basic Setup
[rag]reranking_enabled = truereranker = "vertex" # "vertex", "mmr", or "none"Reranking Limits
Control how many documents go through reranking:
[rag]# Initial retrieval fetches more documentsinitial_retrieval_count = 50
# Rerank top Nrerank_top_n = 50
# Return top K after rerankinggeneral_retrieval_final_count = 10Per-Source Limits
After reranking, apply per-source limits:
[rag]max_books = 5 # Max book citationsmax_videos = 3 # Max video citationsmax_qa = 5 # Max Q&A citationsmax_web = 3 # Max web citationsHow Vertex AI Ranking Works
Vertex AI Ranking uses a cross-encoder architecture:
┌─────────────────────────────────────────┐│ Cross-Encoder ││ ││ Query: "train neural network" ││ Document: "This guide covers..." ││ ││ ┌─────────────┐ ││ │ Transformer │ ││ │ Layers │ ││ └──────┬──────┘ ││ │ ││ ▼ ││ Relevance Score: 0.94 │└─────────────────────────────────────────┘Unlike bi-encoders (used in vector search) which encode query and document separately, cross-encoders process them together, enabling richer interaction modeling.
Reranking Without Google Cloud
For deployments without Vertex AI access:
Option 1: Hybrid Fusion Only
Rely on well-tuned hybrid search:
[rag]reranking_enabled = falsehybrid_fusion = "rrf"enable_fts = trueOption 2: MMR
Use MMR for diversity-aware selection:
[rag]reranking_enabled = truereranker = "mmr"mmr_lambda = 0.7Option 3: Local Cross-Encoder (Future)
Coming soon: local reranking with models like FlashRank or sentence-transformers cross-encoders.
Best Practices
Always Rerank for User-Facing Queries
The latency cost (100-200ms) is worth the quality improvement for interactive use.
Skip for Batch Processing
When processing many queries in batch:
runtime_settings = {"reranking_enabled": False}Tune MMR Lambda
| Lambda | Behavior |
|---|---|
| 0.0 | Maximum diversity (different topics) |
| 0.5 | Balanced (default) |
| 1.0 | Maximum relevance (may be repetitive) |
Start with 0.5 and adjust based on user feedback.
Monitor Reranking Impact
Compare results with and without reranking:
# With rerankingresult_reranked = await pipeline.retrieve(query, reranking_enabled=True)
# Without rerankingresult_raw = await pipeline.retrieve(query, reranking_enabled=False)
# Compare overlap, order, and user satisfaction