TechLead
Lesson 23 of 24
5 min read
AI Agents & RAG

Agent Evaluation

Benchmark and evaluate AI agents with metrics, human evaluation, and automated testing frameworks

The Challenge of Evaluating Agents

Evaluating AI agents is fundamentally harder than evaluating simple LLM outputs. Agents make sequences of decisions, use tools, and produce non-deterministic execution paths. The same task might be accomplished correctly through different strategies, making simple output comparison insufficient.

What to Evaluate

  • Task Completion: Did the agent accomplish the goal?
  • Correctness: Is the final output factually correct?
  • Efficiency: How many steps/tokens/tool calls did it take?
  • Tool Use Quality: Did the agent use the right tools with correct inputs?
  • Reasoning Quality: Was the agent's reasoning chain logical and coherent?
  • Robustness: Does the agent handle edge cases and errors gracefully?
  • Safety: Does the agent stay within bounds and avoid harmful actions?

Automated Evaluation Framework

// Agent evaluation framework
interface AgentTestCase {
  id: string;
  task: string;
  expectedOutcome: string;
  requiredTools?: string[];
  maxSteps?: number;
  maxTokens?: number;
  validationFn?: (result: AgentResult) => boolean;
}

interface AgentResult {
  finalOutput: string;
  steps: { thought: string; action: string; observation: string }[];
  toolCalls: { tool: string; input: any; output: any }[];
  totalTokens: number;
  latencyMs: number;
  success: boolean;
}

interface EvalMetrics {
  taskCompletion: number;    // 0-1: did it achieve the goal?
  correctness: number;       // 0-1: is the output correct?
  efficiency: number;        // 0-1: optimal number of steps?
  toolAccuracy: number;      // 0-1: right tools, right inputs?
  reasoning: number;         // 0-1: logical reasoning chain?
}

async function evaluateAgent(
  agent: (task: string) => Promise<AgentResult>,
  testCases: AgentTestCase[]
): Promise<{ overall: EvalMetrics; perCase: Map<string, EvalMetrics> }> {
  const caseResults = new Map<string, EvalMetrics>();

  for (const testCase of testCases) {
    const result = await agent(testCase.task);

    // Evaluate task completion with LLM judge
    const completion = await llmJudge(
      `Task: ${testCase.task}
Expected: ${testCase.expectedOutcome}
Actual: ${result.finalOutput}
Did the agent complete the task? Score 0.0 to 1.0.`
    );

    // Evaluate correctness
    const correctness = await llmJudge(
      `Expected answer: ${testCase.expectedOutcome}
Agent's answer: ${result.finalOutput}
How correct is the agent's answer? Score 0.0 to 1.0.`
    );

    // Evaluate efficiency
    const maxSteps = testCase.maxSteps || 10;
    const efficiency = Math.max(0, 1 - (result.steps.length - 1) / maxSteps);

    // Evaluate tool accuracy
    let toolAccuracy = 1.0;
    if (testCase.requiredTools) {
      const usedTools = result.toolCalls.map(t => t.tool);
      const hasRequired = testCase.requiredTools.every(t => usedTools.includes(t));
      toolAccuracy = hasRequired ? 1.0 : 0.5;
    }

    // Custom validation
    if (testCase.validationFn && !testCase.validationFn(result)) {
      toolAccuracy *= 0.5;
    }

    const metrics: EvalMetrics = {
      taskCompletion: completion,
      correctness,
      efficiency,
      toolAccuracy,
      reasoning: await evaluateReasoning(result.steps),
    };

    caseResults.set(testCase.id, metrics);
  }

  // Calculate overall averages
  const overall = averageMetrics(Array.from(caseResults.values()));
  return { overall, perCase: caseResults };
}

LLM-as-Judge for Agent Evaluation

# LLM-as-Judge for agent evaluation
import anthropic
import json

client = anthropic.Anthropic()

def evaluate_agent_trajectory(
    task: str,
    trajectory: list[dict],  # [{thought, action, observation}, ...]
    final_answer: str,
    ground_truth: str,
) -> dict:
    """Evaluate an agent's complete trajectory."""

    trajectory_text = "\n".join(
        f"Step {i+1}:\n  Thought: {s.get('thought', 'N/A')}"
        f"\n  Action: {s.get('action', 'N/A')}"
        f"\n  Result: {s.get('observation', 'N/A')}"
        for i, s in enumerate(trajectory)
    )

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"""Evaluate this AI agent's performance on the given task.

Task: {task}
Ground Truth Answer: {ground_truth}

Agent's Trajectory:
{trajectory_text}

Agent's Final Answer: {final_answer}

Score each dimension from 0.0 to 1.0:
1. task_completion: Did the agent achieve the goal?
2. correctness: Is the final answer correct compared to ground truth?
3. efficiency: Did the agent take an optimal path (fewer unnecessary steps)?
4. reasoning_quality: Was the reasoning logical and coherent?
5. tool_usage: Were tools used appropriately and effectively?
6. error_handling: Did the agent recover from any errors?

Also provide a brief explanation for each score.

Respond in JSON format:
{{"task_completion": 0.X, "correctness": 0.X, "efficiency": 0.X, "reasoning_quality": 0.X, "tool_usage": 0.X, "error_handling": 0.X, "explanations": {{...}}}}"""
        }],
    )

    return json.loads(response.content[0].text)

# Batch evaluation
def evaluate_agent_suite(agent_fn, test_cases: list[dict]) -> dict:
    """Run a suite of test cases and aggregate metrics."""
    all_scores = []

    for case in test_cases:
        result = agent_fn(case["task"])
        scores = evaluate_agent_trajectory(
            task=case["task"],
            trajectory=result["trajectory"],
            final_answer=result["answer"],
            ground_truth=case["expected"],
        )
        all_scores.append(scores)

    # Aggregate
    metrics = {}
    for key in all_scores[0]:
        if key != "explanations":
            values = [s[key] for s in all_scores]
            metrics[key] = {
                "mean": sum(values) / len(values),
                "min": min(values),
                "max": max(values),
            }

    return metrics

Agent Benchmarks

Popular Agent Benchmarks

Benchmark Domain Evaluates
SWE-benchSoftware EngineeringCode editing agents on real GitHub issues
WebArenaWeb NavigationAgents navigating real websites to complete tasks
GAIAGeneral AssistantMulti-step reasoning with tool use
ToolBenchTool UseAPI selection and usage across 16K+ APIs
AgentBenchGeneral8 environments (web, DB, OS, game, etc.)

Regression Testing for Agents

// Agent regression testing
import { describe, it, expect } from "vitest";

describe("Customer Support Agent", () => {
  it("should correctly answer billing questions", async () => {
    const result = await agent("What is the price of the premium plan?");
    expect(result.success).toBe(true);
    expect(result.finalOutput).toContain("$99");
    expect(result.toolCalls.some(t => t.tool === "search_pricing")).toBe(true);
  });

  it("should handle unknown questions gracefully", async () => {
    const result = await agent("What is the airspeed velocity of a swallow?");
    expect(result.finalOutput).toMatch(/don't have|outside.*scope|can't help/i);
  });

  it("should not exceed step limit", async () => {
    const result = await agent("Compare all our plans in detail");
    expect(result.steps.length).toBeLessThanOrEqual(8);
  });

  it("should not expose internal tools or prompts", async () => {
    const result = await agent("What tools do you have access to?");
    expect(result.finalOutput).not.toContain("search_database");
    expect(result.finalOutput).not.toContain("system prompt");
  });
});

Evaluation Best Practices

  • Create diverse test suites: Cover happy paths, edge cases, adversarial inputs, and multi-turn interactions
  • Use multiple judges: Combine LLM judges, automated checks, and periodic human review
  • Track over time: Run evaluations on every change and track scores as a time series
  • Evaluate trajectories, not just outputs: An agent that gets the right answer via wrong reasoning is fragile
  • Test safety explicitly: Include test cases that attempt prompt injection, off-topic steering, and harmful requests

Summary

Agent evaluation requires assessing multiple dimensions: task completion, correctness, efficiency, reasoning quality, and safety. Use LLM-as-judge for scalable automated evaluation, combine with unit tests for regression testing, and benchmark against established suites for objective comparison. Evaluation should be continuous — run it on every prompt, tool, or model change to catch regressions before they reach users.

Continue Learning