What is Reranking Techniques?

Improve retrieval precision with cross-encoder reranking, Cohere Rerank, and ColBERT

Reranking Techniques - AI Agents & RAG Tutorial | TechLead

What is Reranking?

Reranking is a two-stage retrieval approach: first, retrieve a broad set of candidate documents using fast approximate methods (vector similarity, BM25), then use a more accurate but slower model to re-score and reorder them. This gives you the speed of approximate search with the accuracy of exact matching.

The key insight is that bi-encoder models (used for embeddings) process the query and document independently, then compare the resulting vectors. Cross-encoder models (used for reranking) process the query and document together, allowing much richer interaction between them. This cross-attention produces more accurate relevance scores.

Bi-Encoder vs Cross-Encoder

Property	Bi-Encoder (Retrieval)	Cross-Encoder (Reranking)
Input	Query and doc separately	Query + doc together
Output	Independent vectors	Single relevance score
Speed	Very fast (precomputed)	Slow (runs per pair)
Accuracy	Good	Excellent
Use Case	Initial retrieval (top 100)	Reranking (top 100 to top 5)

Cross-Encoder Reranking

# Cross-encoder reranking with sentence-transformers
from sentence_transformers import CrossEncoder
import numpy as np

# Load a cross-encoder model
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")

def rerank_documents(
    query: str,
    documents: list[str],
    top_k: int = 5,
) -> list[tuple[str, float]]:
    """Rerank documents using a cross-encoder."""
    # Create query-document pairs
    pairs = [(query, doc) for doc in documents]

    # Score all pairs
    scores = reranker.predict(pairs)

    # Sort by score (descending)
    scored_docs = list(zip(documents, scores))
    scored_docs.sort(key=lambda x: x[1], reverse=True)

    return scored_docs[:top_k]

# Example: Initial retrieval returns 20 candidates, reranking picks top 5
candidates = [
    "To reset your password, go to Settings > Security",
    "Password requirements: 8 chars, 1 uppercase, 1 number",
    "Account recovery options include email and phone",
    "Two-factor authentication adds an extra security layer",
    "Contact support for locked account assistance",
    "Our pricing plans start at $9.99/month",
    "API documentation is available at docs.example.com",
]

reranked = rerank_documents(
    query="How do I change my password?",
    documents=candidates,
    top_k=3,
)

for doc, score in reranked:
    print(f"[{score:.4f}] {doc}")

Cohere Rerank API

Cohere offers a managed reranking API that's simple to integrate and performs well across domains.

// Cohere Rerank API
import { CohereClient } from "cohere-ai";

const cohere = new CohereClient({
  token: process.env.COHERE_API_KEY!,
});

async function cohereRerank(
  query: string,
  documents: string[],
  topK: number = 5
): Promise<{ text: string; score: number; index: number }[]> {
  const response = await cohere.rerank({
    model: "rerank-english-v3.0",
    query,
    documents,
    topN: topK,
    returnDocuments: true,
  });

  return response.results.map(r => ({
    text: documents[r.index],
    score: r.relevanceScore,
    index: r.index,
  }));
}

// Integration with RAG pipeline
async function ragWithReranking(question: string): Promise<string> {
  // Step 1: Retrieve broad set of candidates
  const candidates = await vectorStore.similaritySearch(question, 20);

  // Step 2: Rerank with Cohere
  const reranked = await cohereRerank(
    question,
    candidates.map(c => c.pageContent),
    5
  );

  // Step 3: Use top reranked results as context
  const context = reranked.map(r => r.text).join("\n\n");

  // Step 4: Generate answer
  const response = await anthropic.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 1024,
    system: `Answer based on context:\n${context}`,
    messages: [{ role: "user", content: question }],
  });

  return response.content[0].type === "text" ? response.content[0].text : "";
}

# Cohere Rerank in Python
import cohere

co = cohere.Client()

def rerank_with_cohere(
    query: str,
    documents: list[str],
    top_n: int = 5,
) -> list[dict]:
    results = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=documents,
        top_n=top_n,
        return_documents=True,
    )

    return [
        {
            "text": r.document.text,
            "score": r.relevance_score,
            "original_index": r.index,
        }
        for r in results.results
    ]

# Usage
docs = ["doc1...", "doc2...", "doc3..."]
reranked = rerank_with_cohere("my query", docs)
for r in reranked:
    print(f"[{r['score']:.4f}] {r['text'][:80]}...")

LLM-Based Reranking

You can also use an LLM itself to rerank results, especially when you need domain-specific relevance judgments.

// LLM-based reranking with Claude
async function llmRerank(
  query: string,
  documents: { id: string; content: string }[],
  topK: number = 5
): Promise<string[]> {
  const docList = documents
    .map((d, i) => `[Doc ${i + 1}] ${d.content.slice(0, 200)}`)
    .join("\n\n");

  const response = await client.messages.create({
    model: "claude-sonnet-4-20250514",
    max_tokens: 256,
    messages: [
      {
        role: "user",
        content: `Given the query: "${query}"

Rank these documents by relevance. Return ONLY the document numbers in order of relevance, most relevant first.
Format: 3, 1, 5, 2, 4

Documents:
${docList}`,
      },
    ],
  });

  const text = (response.content[0] as any).text;
  const indices = text.match(/\d+/g)?.map(Number) || [];

  return indices
    .filter(i => i >= 1 && i <= documents.length)
    .slice(0, topK)
    .map(i => documents[i - 1].content);
}

Reranking Best Practices

Over-retrieve, then rerank: Fetch 20-50 candidates, rerank to top 3-5. More candidates = better final results.
Cross-encoder models are small: Local cross-encoders run in milliseconds — don't fear the latency.
Cohere Rerank is production-ready: If you need a managed API, Cohere provides excellent quality with simple integration.
Combine with hybrid search: Hybrid retrieval + reranking is the gold standard for RAG quality in 2026.
Cache reranking results: If the same query is asked often, cache the reranked order to avoid repeated computation.

Summary

Reranking is one of the highest-impact improvements you can make to a RAG pipeline. By adding a cross-encoder or API-based reranker between retrieval and generation, you dramatically improve the precision of your context window. The extra 50-200ms of latency is almost always worth the quality improvement.

Reranking Techniques

What is Reranking?

Bi-Encoder vs Cross-Encoder

Cross-Encoder Reranking

Cohere Rerank API

LLM-Based Reranking

Reranking Best Practices

Summary

Continue Learning

AI-Native Engineering

AI & Machine Learning

LangChain

Python

Vercel AI SDK