Beyond Basic Similarity Search
The default RAG retrieval strategy — embed a query, find the K nearest vectors — works for simple cases but often falls short in production. Documents may be relevant but not surfaced, results may be redundant, or the semantic gap between questions and answers may fool simple cosine similarity. Advanced retrieval strategies address these challenges.
Retrieval Strategy Spectrum
- Naive Similarity: Simple cosine similarity — fast but may miss relevant results
- MMR (Maximal Marginal Relevance): Balances relevance with diversity in results
- Hybrid Search: Combines semantic vectors with keyword matching (BM25)
- Multi-Query: Generates multiple query variations for broader retrieval
- HyDE: Generates a hypothetical answer, then searches for similar documents
- Reranking: Uses a cross-encoder model to re-score and reorder initial results
Maximal Marginal Relevance (MMR)
MMR balances relevance (how closely a document matches the query) with diversity (how different it is from already-selected documents). This prevents the top-K results from being near-duplicates that all say the same thing.
# MMR implementation
import numpy as np
from typing import List, Tuple
def maximal_marginal_relevance(
query_embedding: np.ndarray,
doc_embeddings: np.ndarray,
documents: List[str],
k: int = 5,
lambda_param: float = 0.5, # 0 = max diversity, 1 = max relevance
) -> List[Tuple[int, str, float]]:
"""Select documents balancing relevance and diversity."""
# Calculate query-document similarities
query_sims = np.dot(doc_embeddings, query_embedding) / (
np.linalg.norm(doc_embeddings, axis=1) * np.linalg.norm(query_embedding)
)
selected = []
remaining = list(range(len(documents)))
for _ in range(min(k, len(documents))):
best_score = -float("inf")
best_idx = -1
for idx in remaining:
# Relevance to query
relevance = query_sims[idx]
# Max similarity to already selected documents
if selected:
selected_embeddings = doc_embeddings[selected]
doc_sims = np.dot(selected_embeddings, doc_embeddings[idx]) / (
np.linalg.norm(selected_embeddings, axis=1)
* np.linalg.norm(doc_embeddings[idx])
)
max_sim_to_selected = np.max(doc_sims)
else:
max_sim_to_selected = 0
# MMR score
mmr_score = lambda_param * relevance - (1 - lambda_param) * max_sim_to_selected
if mmr_score > best_score:
best_score = mmr_score
best_idx = idx
selected.append(best_idx)
remaining.remove(best_idx)
return [(i, documents[i], query_sims[i]) for i in selected]
# Using LangChain's built-in MMR
from langchain_chroma import Chroma
vectorstore = Chroma(persist_directory="./db", embedding_function=embeddings)
# MMR retrieval
retriever = vectorstore.as_retriever(
search_type="mmr",
search_kwargs={
"k": 5,
"fetch_k": 20, # Fetch 20 candidates, select 5 diverse ones
"lambda_mult": 0.5, # Balance relevance and diversity
},
)
docs = retriever.invoke("What are the best practices for API design?")
Multi-Query Retrieval
A single query may not capture all aspects of the user's question. Multi-query retrieval uses the LLM to generate multiple variations of the query, retrieves results for each, and merges the results for better coverage.
// Multi-query retrieval
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
async function multiQueryRetrieval(
question: string,
vectorStore: any,
numQueries: number = 3
): Promise<string[]> {
// Step 1: Generate query variations
const response = await client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 512,
messages: [
{
role: "user",
content: `Generate ${numQueries} different search queries to find information relevant to this question. Each query should approach the question from a different angle.
Question: ${question}
Return ONLY the queries, one per line, no numbering.`,
},
],
});
const queries = (response.content[0] as any).text
.split("\n")
.filter((q: string) => q.trim());
// Step 2: Retrieve for each query
const allDocs = new Map<string, any>();
for (const query of [question, ...queries]) {
const results = await vectorStore.similaritySearch(query, 3);
for (const doc of results) {
const key = doc.pageContent.slice(0, 100); // Dedup key
if (!allDocs.has(key)) {
allDocs.set(key, doc);
}
}
}
// Step 3: Return unique documents
return Array.from(allDocs.values()).map(d => d.pageContent);
}
Hypothetical Document Embeddings (HyDE)
HyDE generates a hypothetical answer to the question, embeds that answer, and uses it to search. This works because the hypothetical answer is likely to use similar language and terminology as the actual documents, bridging the semantic gap between questions and answers.
# HyDE - Hypothetical Document Embeddings
import anthropic
from sentence_transformers import SentenceTransformer
client = anthropic.Anthropic()
embedder = SentenceTransformer("all-MiniLM-L6-v2")
def hyde_retrieval(question: str, collection, n_results: int = 5):
# Step 1: Generate a hypothetical answer
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=512,
messages=[{
"role": "user",
"content": f"""Write a short paragraph that would be the perfect answer
to this question, as if it appeared in a document:
Question: {question}
Write ONLY the answer paragraph, no preamble.""",
}],
)
hypothetical_answer = response.content[0].text
# Step 2: Embed the hypothetical answer (not the question!)
hyde_embedding = embedder.encode(hypothetical_answer).tolist()
# Step 3: Search with the hypothetical answer embedding
results = collection.query(
query_embeddings=[hyde_embedding],
n_results=n_results,
)
return results
# Comparison: standard vs HyDE
standard_results = collection.query(
query_texts=["What causes memory leaks in Node.js?"],
n_results=5,
)
hyde_results = hyde_retrieval(
"What causes memory leaks in Node.js?",
collection,
)
Contextual Compression
// Contextual compression - extract only relevant parts from retrieved docs
async function compressContext(
question: string,
documents: string[],
): Promise<string[]> {
const compressed: string[] = [];
for (const doc of documents) {
const response = await client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 256,
messages: [
{
role: "user",
content: `Given this question: "${question}"
Extract ONLY the parts of this document that are relevant to answering the question.
If nothing is relevant, respond with "NOT_RELEVANT".
Document:
${doc}`,
},
],
});
const result = (response.content[0] as any).text;
if (!result.includes("NOT_RELEVANT")) {
compressed.push(result);
}
}
return compressed;
}
Strategy Comparison
| Strategy | Latency | Cost | Best For |
|---|---|---|---|
| Naive Similarity | Low | Low | Simple QA |
| MMR | Low | Low | Avoiding duplicate results |
| Multi-Query | Medium | Medium | Complex questions |
| HyDE | Medium | Medium | Q-to-A semantic gap |
| Compression | High | High | Long documents, noise reduction |
Summary
Start with basic similarity search, then layer on additional strategies based on your quality requirements. MMR is a free improvement for diversity. Multi-query and HyDE add moderate cost but significantly improve retrieval for complex questions. Combine these with reranking (covered in a dedicated lesson) for the best results. Always measure retrieval quality with real queries before and after applying each strategy.