What is Reranking?
Reranking is a two-stage retrieval approach: first, retrieve a broad set of candidate documents using fast approximate methods (vector similarity, BM25), then use a more accurate but slower model to re-score and reorder them. This gives you the speed of approximate search with the accuracy of exact matching.
The key insight is that bi-encoder models (used for embeddings) process the query and document independently, then compare the resulting vectors. Cross-encoder models (used for reranking) process the query and document together, allowing much richer interaction between them. This cross-attention produces more accurate relevance scores.
Bi-Encoder vs Cross-Encoder
| Property | Bi-Encoder (Retrieval) | Cross-Encoder (Reranking) |
|---|---|---|
| Input | Query and doc separately | Query + doc together |
| Output | Independent vectors | Single relevance score |
| Speed | Very fast (precomputed) | Slow (runs per pair) |
| Accuracy | Good | Excellent |
| Use Case | Initial retrieval (top 100) | Reranking (top 100 to top 5) |
Cross-Encoder Reranking
# Cross-encoder reranking with sentence-transformers
from sentence_transformers import CrossEncoder
import numpy as np
# Load a cross-encoder model
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")
def rerank_documents(
query: str,
documents: list[str],
top_k: int = 5,
) -> list[tuple[str, float]]:
"""Rerank documents using a cross-encoder."""
# Create query-document pairs
pairs = [(query, doc) for doc in documents]
# Score all pairs
scores = reranker.predict(pairs)
# Sort by score (descending)
scored_docs = list(zip(documents, scores))
scored_docs.sort(key=lambda x: x[1], reverse=True)
return scored_docs[:top_k]
# Example: Initial retrieval returns 20 candidates, reranking picks top 5
candidates = [
"To reset your password, go to Settings > Security",
"Password requirements: 8 chars, 1 uppercase, 1 number",
"Account recovery options include email and phone",
"Two-factor authentication adds an extra security layer",
"Contact support for locked account assistance",
"Our pricing plans start at $9.99/month",
"API documentation is available at docs.example.com",
]
reranked = rerank_documents(
query="How do I change my password?",
documents=candidates,
top_k=3,
)
for doc, score in reranked:
print(f"[{score:.4f}] {doc}")
Cohere Rerank API
Cohere offers a managed reranking API that's simple to integrate and performs well across domains.
// Cohere Rerank API
import { CohereClient } from "cohere-ai";
const cohere = new CohereClient({
token: process.env.COHERE_API_KEY!,
});
async function cohereRerank(
query: string,
documents: string[],
topK: number = 5
): Promise<{ text: string; score: number; index: number }[]> {
const response = await cohere.rerank({
model: "rerank-english-v3.0",
query,
documents,
topN: topK,
returnDocuments: true,
});
return response.results.map(r => ({
text: documents[r.index],
score: r.relevanceScore,
index: r.index,
}));
}
// Integration with RAG pipeline
async function ragWithReranking(question: string): Promise<string> {
// Step 1: Retrieve broad set of candidates
const candidates = await vectorStore.similaritySearch(question, 20);
// Step 2: Rerank with Cohere
const reranked = await cohereRerank(
question,
candidates.map(c => c.pageContent),
5
);
// Step 3: Use top reranked results as context
const context = reranked.map(r => r.text).join("\n\n");
// Step 4: Generate answer
const response = await anthropic.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 1024,
system: `Answer based on context:\n${context}`,
messages: [{ role: "user", content: question }],
});
return response.content[0].type === "text" ? response.content[0].text : "";
}
# Cohere Rerank in Python
import cohere
co = cohere.Client()
def rerank_with_cohere(
query: str,
documents: list[str],
top_n: int = 5,
) -> list[dict]:
results = co.rerank(
model="rerank-english-v3.0",
query=query,
documents=documents,
top_n=top_n,
return_documents=True,
)
return [
{
"text": r.document.text,
"score": r.relevance_score,
"original_index": r.index,
}
for r in results.results
]
# Usage
docs = ["doc1...", "doc2...", "doc3..."]
reranked = rerank_with_cohere("my query", docs)
for r in reranked:
print(f"[{r['score']:.4f}] {r['text'][:80]}...")
LLM-Based Reranking
You can also use an LLM itself to rerank results, especially when you need domain-specific relevance judgments.
// LLM-based reranking with Claude
async function llmRerank(
query: string,
documents: { id: string; content: string }[],
topK: number = 5
): Promise<string[]> {
const docList = documents
.map((d, i) => `[Doc ${i + 1}] ${d.content.slice(0, 200)}`)
.join("\n\n");
const response = await client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 256,
messages: [
{
role: "user",
content: `Given the query: "${query}"
Rank these documents by relevance. Return ONLY the document numbers in order of relevance, most relevant first.
Format: 3, 1, 5, 2, 4
Documents:
${docList}`,
},
],
});
const text = (response.content[0] as any).text;
const indices = text.match(/\d+/g)?.map(Number) || [];
return indices
.filter(i => i >= 1 && i <= documents.length)
.slice(0, topK)
.map(i => documents[i - 1].content);
}
Reranking Best Practices
- Over-retrieve, then rerank: Fetch 20-50 candidates, rerank to top 3-5. More candidates = better final results.
- Cross-encoder models are small: Local cross-encoders run in milliseconds — don't fear the latency.
- Cohere Rerank is production-ready: If you need a managed API, Cohere provides excellent quality with simple integration.
- Combine with hybrid search: Hybrid retrieval + reranking is the gold standard for RAG quality in 2026.
- Cache reranking results: If the same query is asked often, cache the reranked order to avoid repeated computation.
Summary
Reranking is one of the highest-impact improvements you can make to a RAG pipeline. By adding a cross-encoder or API-based reranker between retrieval and generation, you dramatically improve the precision of your context window. The extra 50-200ms of latency is almost always worth the quality improvement.