TechLead
Lesson 16 of 24
5 min read
AI Agents & RAG

Reranking Techniques

Improve retrieval precision with cross-encoder reranking, Cohere Rerank, and ColBERT

What is Reranking?

Reranking is a two-stage retrieval approach: first, retrieve a broad set of candidate documents using fast approximate methods (vector similarity, BM25), then use a more accurate but slower model to re-score and reorder them. This gives you the speed of approximate search with the accuracy of exact matching.

The key insight is that bi-encoder models (used for embeddings) process the query and document independently, then compare the resulting vectors. Cross-encoder models (used for reranking) process the query and document together, allowing much richer interaction between them. This cross-attention produces more accurate relevance scores.

Bi-Encoder vs Cross-Encoder

Property Bi-Encoder (Retrieval) Cross-Encoder (Reranking)
InputQuery and doc separatelyQuery + doc together
OutputIndependent vectorsSingle relevance score
SpeedVery fast (precomputed)Slow (runs per pair)
AccuracyGoodExcellent
Use CaseInitial retrieval (top 100)Reranking (top 100 to top 5)

Cross-Encoder Reranking

# Cross-encoder reranking with sentence-transformers
from sentence_transformers import CrossEncoder
import numpy as np

# Load a cross-encoder model
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")

def rerank_documents(
    query: str,
    documents: list[str],
    top_k: int = 5,
) -> list[tuple[str, float]]:
    """Rerank documents using a cross-encoder."""
    # Create query-document pairs
    pairs = [(query, doc) for doc in documents]

    # Score all pairs
    scores = reranker.predict(pairs)

    # Sort by score (descending)
    scored_docs = list(zip(documents, scores))
    scored_docs.sort(key=lambda x: x[1], reverse=True)

    return scored_docs[:top_k]

# Example: Initial retrieval returns 20 candidates, reranking picks top 5
candidates = [
    "To reset your password, go to Settings > Security",
    "Password requirements: 8 chars, 1 uppercase, 1 number",
    "Account recovery options include email and phone",
    "Two-factor authentication adds an extra security layer",
    "Contact support for locked account assistance",
    "Our pricing plans start at $9.99/month",
    "API documentation is available at docs.example.com",
]

reranked = rerank_documents(
    query="How do I change my password?",
    documents=candidates,
    top_k=3,
)

for doc, score in reranked:
    print(f"[{score:.4f}] {doc}")

Cohere Rerank API

Cohere offers a managed reranking API that's simple to integrate and performs well across domains.

// Cohere Rerank API
import { CohereClient } from "cohere-ai";

const cohere = new CohereClient({
  token: process.env.COHERE_API_KEY!,
});

async function cohereRerank(
  query: string,
  documents: string[],
  topK: number = 5
): Promise<{ text: string; score: number; index: number }[]> {
  const response = await cohere.rerank({
    model: "rerank-english-v3.0",
    query,
    documents,
    topN: topK,
    returnDocuments: true,
  });

  return response.results.map(r => ({
    text: documents[r.index],
    score: r.relevanceScore,
    index: r.index,
  }));
}

// Integration with RAG pipeline
async function ragWithReranking(question: string): Promise<string> {
  // Step 1: Retrieve broad set of candidates
  const candidates = await vectorStore.similaritySearch(question, 20);

  // Step 2: Rerank with Cohere
  const reranked = await cohereRerank(
    question,
    candidates.map(c => c.pageContent),
    5
  );

  // Step 3: Use top reranked results as context
  const context = reranked.map(r => r.text).join("\n\n");

  // Step 4: Generate answer
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 1024,
    system: `Answer based on context:\n${context}`,
    messages: [{ role: "user", content: question }],
  });

  return response.content[0].type === "text" ? response.content[0].text : "";
}
# Cohere Rerank in Python
import cohere

co = cohere.Client()

def rerank_with_cohere(
    query: str,
    documents: list[str],
    top_n: int = 5,
) -> list[dict]:
    results = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=documents,
        top_n=top_n,
        return_documents=True,
    )

    return [
        {
            "text": r.document.text,
            "score": r.relevance_score,
            "original_index": r.index,
        }
        for r in results.results
    ]

# Usage
docs = ["doc1...", "doc2...", "doc3..."]
reranked = rerank_with_cohere("my query", docs)
for r in reranked:
    print(f"[{r['score']:.4f}] {r['text'][:80]}...")

LLM-Based Reranking

You can also use an LLM itself to rerank results, especially when you need domain-specific relevance judgments.

// LLM-based reranking with Claude
async function llmRerank(
  query: string,
  documents: { id: string; content: string }[],
  topK: number = 5
): Promise<string[]> {
  const docList = documents
    .map((d, i) => `[Doc ${i + 1}] ${d.content.slice(0, 200)}`)
    .join("\n\n");

  const response = await client.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 256,
    messages: [
      {
        role: "user",
        content: `Given the query: "${query}"

Rank these documents by relevance. Return ONLY the document numbers in order of relevance, most relevant first.
Format: 3, 1, 5, 2, 4

Documents:
${docList}`,
      },
    ],
  });

  const text = (response.content[0] as any).text;
  const indices = text.match(/\d+/g)?.map(Number) || [];

  return indices
    .filter(i => i >= 1 && i <= documents.length)
    .slice(0, topK)
    .map(i => documents[i - 1].content);
}

Reranking Best Practices

  • Over-retrieve, then rerank: Fetch 20-50 candidates, rerank to top 3-5. More candidates = better final results.
  • Cross-encoder models are small: Local cross-encoders run in milliseconds — don't fear the latency.
  • Cohere Rerank is production-ready: If you need a managed API, Cohere provides excellent quality with simple integration.
  • Combine with hybrid search: Hybrid retrieval + reranking is the gold standard for RAG quality in 2026.
  • Cache reranking results: If the same query is asked often, cache the reranked order to avoid repeated computation.

Summary

Reranking is one of the highest-impact improvements you can make to a RAG pipeline. By adding a cross-encoder or API-based reranker between retrieval and generation, you dramatically improve the precision of your context window. The extra 50-200ms of latency is almost always worth the quality improvement.

Continue Learning