Why Evaluate RAG?
Building a RAG system is only half the battle — you need to know how well it works. Without systematic evaluation, you're flying blind: you won't know if your chunking is too aggressive, your retrieval is missing relevant documents, or your LLM is hallucinating despite having good context. RAG evaluation provides quantitative metrics to guide optimization.
Key RAG Metrics
- Faithfulness: Does the answer only use information from the retrieved context? (No hallucination)
- Answer Relevancy: Is the answer actually relevant to the question asked?
- Context Precision: Are the retrieved documents relevant to the question?
- Context Recall: Did retrieval find all the relevant documents?
- Answer Correctness: Is the answer factually correct? (Requires ground truth)
RAGAS Framework
RAGAS (Retrieval Augmented Generation Assessment) is the standard framework for evaluating RAG pipelines. It provides automated metrics that don't require human labels for most evaluations — the LLM itself serves as the judge.
# RAG evaluation with RAGAS
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_precision,
context_recall,
)
from datasets import Dataset
# Prepare evaluation dataset
# Each sample needs: question, answer, contexts, ground_truth
eval_data = {
"question": [
"What is our vacation policy?",
"How do I reset my password?",
"What are the API rate limits?",
],
"answer": [
"The company offers 20 days of paid vacation per year for full-time employees.",
"Go to Settings > Security > Change Password to reset your password.",
"Free tier: 100 req/min. Premium: 10,000 req/min.",
],
"contexts": [
["Full-time employees receive 20 days of paid time off annually. Part-time employees receive 10 days."],
["To change your password: navigate to Settings, then Security, then Change Password. Enter current and new password."],
["API rate limits: Free tier allows 100 requests per minute. Premium tier allows 10,000 requests per minute with burst to 15,000."],
],
"ground_truth": [
"Full-time employees get 20 days PTO per year.",
"Navigate to Settings > Security > Change Password.",
"Free: 100 req/min, Premium: 10,000 req/min.",
],
}
dataset = Dataset.from_dict(eval_data)
# Run evaluation
results = evaluate(
dataset=dataset,
metrics=[
faithfulness,
answer_relevancy,
context_precision,
context_recall,
],
)
print(results)
# {'faithfulness': 0.95, 'answer_relevancy': 0.92,
# 'context_precision': 0.88, 'context_recall': 0.90}
# Get per-sample scores
df = results.to_pandas()
print(df[["question", "faithfulness", "answer_relevancy"]])
Custom Evaluation Pipeline
For more control, build your own evaluation pipeline using an LLM as a judge.
// Custom RAG evaluation with LLM-as-judge
import Anthropic from "@anthropic-ai/sdk";
const client = new Anthropic();
interface EvalSample {
question: string;
answer: string;
contexts: string[];
groundTruth?: string;
}
interface EvalScores {
faithfulness: number;
relevancy: number;
completeness: number;
overall: number;
}
async function evaluateRAGSample(sample: EvalSample): Promise<EvalScores> {
const response = await client.messages.create({
model: "claude-sonnet-4-20250514",
max_tokens: 512,
messages: [
{
role: "user",
content: `Evaluate this RAG system output. Score each metric from 0.0 to 1.0.
Question: ${sample.question}
Retrieved Context:
${sample.contexts.join("\n---\n")}
Generated Answer: ${sample.answer}
${sample.groundTruth ? `Ground Truth: ${sample.groundTruth}` : ""}
Score these metrics (0.0 to 1.0):
1. Faithfulness: Does the answer ONLY use info from the context? (1.0 = no hallucination)
2. Relevancy: Does the answer address the question? (1.0 = perfectly relevant)
3. Completeness: Does the answer cover all aspects of the question? (1.0 = complete)
Respond ONLY in JSON format:
{"faithfulness": 0.X, "relevancy": 0.X, "completeness": 0.X}`,
},
],
});
const text = response.content[0].type === "text" ? response.content[0].text : "{}";
const scores = JSON.parse(text);
return {
...scores,
overall: (scores.faithfulness + scores.relevancy + scores.completeness) / 3,
};
}
// Batch evaluation
async function evaluateRAGPipeline(
samples: EvalSample[]
): Promise<{ averages: EvalScores; details: EvalScores[] }> {
const details = await Promise.all(samples.map(evaluateRAGSample));
const averages: EvalScores = {
faithfulness: details.reduce((s, d) => s + d.faithfulness, 0) / details.length,
relevancy: details.reduce((s, d) => s + d.relevancy, 0) / details.length,
completeness: details.reduce((s, d) => s + d.completeness, 0) / details.length,
overall: details.reduce((s, d) => s + d.overall, 0) / details.length,
};
return { averages, details };
}
// Run evaluation
const results = await evaluateRAGPipeline([
{
question: "What is the refund policy?",
answer: "Refunds are available within 30 days of purchase.",
contexts: ["Our refund policy: full refund within 30 days, partial refund within 60 days."],
groundTruth: "Full refund within 30 days, partial refund within 60 days.",
},
]);
console.log("Average scores:", results.averages);
Retrieval-Specific Metrics
# Retrieval quality metrics
import numpy as np
def precision_at_k(relevant: list[bool], k: int) -> float:
"""What fraction of the top-k results are relevant?"""
return sum(relevant[:k]) / k
def recall_at_k(relevant: list[bool], total_relevant: int, k: int) -> float:
"""What fraction of all relevant docs appear in top-k?"""
return sum(relevant[:k]) / total_relevant if total_relevant > 0 else 0
def mean_reciprocal_rank(relevant: list[bool]) -> float:
"""Rank of the first relevant result (higher = better)."""
for i, is_relevant in enumerate(relevant):
if is_relevant:
return 1 / (i + 1)
return 0
def ndcg_at_k(relevance_scores: list[float], k: int) -> float:
"""Normalized Discounted Cumulative Gain."""
dcg = sum(rel / np.log2(i + 2) for i, rel in enumerate(relevance_scores[:k]))
ideal = sorted(relevance_scores, reverse=True)[:k]
idcg = sum(rel / np.log2(i + 2) for i, rel in enumerate(ideal))
return dcg / idcg if idcg > 0 else 0
# Example evaluation
retrieved_relevant = [True, False, True, True, False] # Which of top-5 are relevant
total_relevant_docs = 4 # Total relevant docs in corpus
print(f"Precision@5: {precision_at_k(retrieved_relevant, 5):.2f}") # 0.60
print(f"Recall@5: {recall_at_k(retrieved_relevant, total_relevant_docs, 5):.2f}") # 0.75
print(f"MRR: {mean_reciprocal_rank(retrieved_relevant):.2f}") # 1.00
Evaluation Checklist
| Metric | Target | Action if Low |
|---|---|---|
| Faithfulness | > 0.90 | Improve system prompt, reduce context noise |
| Relevancy | > 0.85 | Better chunking, add reranking |
| Context Precision | > 0.80 | Improve embeddings, add filtering |
| Context Recall | > 0.80 | Increase top-K, try multi-query retrieval |
Evaluation Best Practices
- Build an eval dataset early: Create 50-100 question-answer pairs covering your key use cases
- Run evals on every change: Treat RAG eval like unit tests — run them in CI/CD after every pipeline change
- Use LLM-as-judge for scale: Human evaluation is gold standard but doesn't scale — use LLM judges with spot-check human review
- Track metrics over time: Monitor evaluation scores to catch regressions early
- Test edge cases: Include questions that should NOT be answered (out of scope) to test guardrails
Summary
RAG evaluation is essential for building reliable AI applications. RAGAS provides a solid framework with automated metrics. For custom needs, build an LLM-as-judge pipeline. The key metrics are faithfulness (no hallucination), relevancy (answers the question), and context precision/recall (retrieval quality). Establish baselines early and evaluate continuously.