The Challenge of Evaluating Agents
Evaluating AI agents is fundamentally harder than evaluating simple LLM outputs. Agents make sequences of decisions, use tools, and produce non-deterministic execution paths. The same task might be accomplished correctly through different strategies, making simple output comparison insufficient.
What to Evaluate
- Task Completion: Did the agent accomplish the goal?
- Correctness: Is the final output factually correct?
- Efficiency: How many steps/tokens/tool calls did it take?
- Tool Use Quality: Did the agent use the right tools with correct inputs?
- Reasoning Quality: Was the agent's reasoning chain logical and coherent?
- Robustness: Does the agent handle edge cases and errors gracefully?
- Safety: Does the agent stay within bounds and avoid harmful actions?
Automated Evaluation Framework
// Agent evaluation framework
interface AgentTestCase {
id: string;
task: string;
expectedOutcome: string;
requiredTools?: string[];
maxSteps?: number;
maxTokens?: number;
validationFn?: (result: AgentResult) => boolean;
}
interface AgentResult {
finalOutput: string;
steps: { thought: string; action: string; observation: string }[];
toolCalls: { tool: string; input: any; output: any }[];
totalTokens: number;
latencyMs: number;
success: boolean;
}
interface EvalMetrics {
taskCompletion: number; // 0-1: did it achieve the goal?
correctness: number; // 0-1: is the output correct?
efficiency: number; // 0-1: optimal number of steps?
toolAccuracy: number; // 0-1: right tools, right inputs?
reasoning: number; // 0-1: logical reasoning chain?
}
async function evaluateAgent(
agent: (task: string) => Promise<AgentResult>,
testCases: AgentTestCase[]
): Promise<{ overall: EvalMetrics; perCase: Map<string, EvalMetrics> }> {
const caseResults = new Map<string, EvalMetrics>();
for (const testCase of testCases) {
const result = await agent(testCase.task);
// Evaluate task completion with LLM judge
const completion = await llmJudge(
`Task: ${testCase.task}
Expected: ${testCase.expectedOutcome}
Actual: ${result.finalOutput}
Did the agent complete the task? Score 0.0 to 1.0.`
);
// Evaluate correctness
const correctness = await llmJudge(
`Expected answer: ${testCase.expectedOutcome}
Agent's answer: ${result.finalOutput}
How correct is the agent's answer? Score 0.0 to 1.0.`
);
// Evaluate efficiency
const maxSteps = testCase.maxSteps || 10;
const efficiency = Math.max(0, 1 - (result.steps.length - 1) / maxSteps);
// Evaluate tool accuracy
let toolAccuracy = 1.0;
if (testCase.requiredTools) {
const usedTools = result.toolCalls.map(t => t.tool);
const hasRequired = testCase.requiredTools.every(t => usedTools.includes(t));
toolAccuracy = hasRequired ? 1.0 : 0.5;
}
// Custom validation
if (testCase.validationFn && !testCase.validationFn(result)) {
toolAccuracy *= 0.5;
}
const metrics: EvalMetrics = {
taskCompletion: completion,
correctness,
efficiency,
toolAccuracy,
reasoning: await evaluateReasoning(result.steps),
};
caseResults.set(testCase.id, metrics);
}
// Calculate overall averages
const overall = averageMetrics(Array.from(caseResults.values()));
return { overall, perCase: caseResults };
}
LLM-as-Judge for Agent Evaluation
# LLM-as-Judge for agent evaluation
import anthropic
import json
client = anthropic.Anthropic()
def evaluate_agent_trajectory(
task: str,
trajectory: list[dict], # [{thought, action, observation}, ...]
final_answer: str,
ground_truth: str,
) -> dict:
"""Evaluate an agent's complete trajectory."""
trajectory_text = "\n".join(
f"Step {i+1}:\n Thought: {s.get('thought', 'N/A')}"
f"\n Action: {s.get('action', 'N/A')}"
f"\n Result: {s.get('observation', 'N/A')}"
for i, s in enumerate(trajectory)
)
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
messages=[{
"role": "user",
"content": f"""Evaluate this AI agent's performance on the given task.
Task: {task}
Ground Truth Answer: {ground_truth}
Agent's Trajectory:
{trajectory_text}
Agent's Final Answer: {final_answer}
Score each dimension from 0.0 to 1.0:
1. task_completion: Did the agent achieve the goal?
2. correctness: Is the final answer correct compared to ground truth?
3. efficiency: Did the agent take an optimal path (fewer unnecessary steps)?
4. reasoning_quality: Was the reasoning logical and coherent?
5. tool_usage: Were tools used appropriately and effectively?
6. error_handling: Did the agent recover from any errors?
Also provide a brief explanation for each score.
Respond in JSON format:
{{"task_completion": 0.X, "correctness": 0.X, "efficiency": 0.X, "reasoning_quality": 0.X, "tool_usage": 0.X, "error_handling": 0.X, "explanations": {{...}}}}"""
}],
)
return json.loads(response.content[0].text)
# Batch evaluation
def evaluate_agent_suite(agent_fn, test_cases: list[dict]) -> dict:
"""Run a suite of test cases and aggregate metrics."""
all_scores = []
for case in test_cases:
result = agent_fn(case["task"])
scores = evaluate_agent_trajectory(
task=case["task"],
trajectory=result["trajectory"],
final_answer=result["answer"],
ground_truth=case["expected"],
)
all_scores.append(scores)
# Aggregate
metrics = {}
for key in all_scores[0]:
if key != "explanations":
values = [s[key] for s in all_scores]
metrics[key] = {
"mean": sum(values) / len(values),
"min": min(values),
"max": max(values),
}
return metrics
Agent Benchmarks
Popular Agent Benchmarks
| Benchmark | Domain | Evaluates |
|---|---|---|
| SWE-bench | Software Engineering | Code editing agents on real GitHub issues |
| WebArena | Web Navigation | Agents navigating real websites to complete tasks |
| GAIA | General Assistant | Multi-step reasoning with tool use |
| ToolBench | Tool Use | API selection and usage across 16K+ APIs |
| AgentBench | General | 8 environments (web, DB, OS, game, etc.) |
Regression Testing for Agents
// Agent regression testing
import { describe, it, expect } from "vitest";
describe("Customer Support Agent", () => {
it("should correctly answer billing questions", async () => {
const result = await agent("What is the price of the premium plan?");
expect(result.success).toBe(true);
expect(result.finalOutput).toContain("$99");
expect(result.toolCalls.some(t => t.tool === "search_pricing")).toBe(true);
});
it("should handle unknown questions gracefully", async () => {
const result = await agent("What is the airspeed velocity of a swallow?");
expect(result.finalOutput).toMatch(/don't have|outside.*scope|can't help/i);
});
it("should not exceed step limit", async () => {
const result = await agent("Compare all our plans in detail");
expect(result.steps.length).toBeLessThanOrEqual(8);
});
it("should not expose internal tools or prompts", async () => {
const result = await agent("What tools do you have access to?");
expect(result.finalOutput).not.toContain("search_database");
expect(result.finalOutput).not.toContain("system prompt");
});
});
Evaluation Best Practices
- Create diverse test suites: Cover happy paths, edge cases, adversarial inputs, and multi-turn interactions
- Use multiple judges: Combine LLM judges, automated checks, and periodic human review
- Track over time: Run evaluations on every change and track scores as a time series
- Evaluate trajectories, not just outputs: An agent that gets the right answer via wrong reasoning is fragile
- Test safety explicitly: Include test cases that attempt prompt injection, off-topic steering, and harmful requests
Summary
Agent evaluation requires assessing multiple dimensions: task completion, correctness, efficiency, reasoning quality, and safety. Use LLM-as-judge for scalable automated evaluation, combine with unit tests for regression testing, and benchmark against established suites for objective comparison. Evaluation should be continuous — run it on every prompt, tool, or model change to catch regressions before they reach users.