What is Agent Evaluation?

Benchmark and evaluate AI agents with metrics, human evaluation, and automated testing frameworks

Agent Evaluation - AI Agents & RAG Tutorial | TechLead

The Challenge of Evaluating Agents

Evaluating AI agents is fundamentally harder than evaluating simple LLM outputs. Agents make sequences of decisions, use tools, and produce non-deterministic execution paths. The same task might be accomplished correctly through different strategies, making simple output comparison insufficient.

What to Evaluate

Task Completion: Did the agent accomplish the goal?
Correctness: Is the final output factually correct?
Efficiency: How many steps/tokens/tool calls did it take?
Tool Use Quality: Did the agent use the right tools with correct inputs?
Reasoning Quality: Was the agent's reasoning chain logical and coherent?
Robustness: Does the agent handle edge cases and errors gracefully?
Safety: Does the agent stay within bounds and avoid harmful actions?

Automated Evaluation Framework

// Agent evaluation framework
interface AgentTestCase {
  id: string;
  task: string;
  expectedOutcome: string;
  requiredTools?: string[];
  maxSteps?: number;
  maxTokens?: number;
  validationFn?: (result: AgentResult) => boolean;
}

interface AgentResult {
  finalOutput: string;
  steps: { thought: string; action: string; observation: string }[];
  toolCalls: { tool: string; input: any; output: any }[];
  totalTokens: number;
  latencyMs: number;
  success: boolean;
}

interface EvalMetrics {
  taskCompletion: number;    // 0-1: did it achieve the goal?
  correctness: number;       // 0-1: is the output correct?
  efficiency: number;        // 0-1: optimal number of steps?
  toolAccuracy: number;      // 0-1: right tools, right inputs?
  reasoning: number;         // 0-1: logical reasoning chain?
}

async function evaluateAgent(
  agent: (task: string) => Promise<AgentResult>,
  testCases: AgentTestCase[]
): Promise<{ overall: EvalMetrics; perCase: Map<string, EvalMetrics> }> {
  const caseResults = new Map<string, EvalMetrics>();

  for (const testCase of testCases) {
    const result = await agent(testCase.task);

    // Evaluate task completion with LLM judge
    const completion = await llmJudge(
      `Task: ${testCase.task}
Expected: ${testCase.expectedOutcome}
Actual: ${result.finalOutput}
Did the agent complete the task? Score 0.0 to 1.0.`
    );

    // Evaluate correctness
    const correctness = await llmJudge(
      `Expected answer: ${testCase.expectedOutcome}
Agent's answer: ${result.finalOutput}
How correct is the agent's answer? Score 0.0 to 1.0.`
    );

    // Evaluate efficiency
    const maxSteps = testCase.maxSteps || 10;
    const efficiency = Math.max(0, 1 - (result.steps.length - 1) / maxSteps);

    // Evaluate tool accuracy
    let toolAccuracy = 1.0;
    if (testCase.requiredTools) {
      const usedTools = result.toolCalls.map(t => t.tool);
      const hasRequired = testCase.requiredTools.every(t => usedTools.includes(t));
      toolAccuracy = hasRequired ? 1.0 : 0.5;
    }

    // Custom validation
    if (testCase.validationFn && !testCase.validationFn(result)) {
      toolAccuracy *= 0.5;
    }

    const metrics: EvalMetrics = {
      taskCompletion: completion,
      correctness,
      efficiency,
      toolAccuracy,
      reasoning: await evaluateReasoning(result.steps),
    };

    caseResults.set(testCase.id, metrics);
  }

  // Calculate overall averages
  const overall = averageMetrics(Array.from(caseResults.values()));
  return { overall, perCase: caseResults };
}

LLM-as-Judge for Agent Evaluation

# LLM-as-Judge for agent evaluation
import anthropic
import json

client = anthropic.Anthropic()

def evaluate_agent_trajectory(
    task: str,
    trajectory: list[dict],  # [{thought, action, observation}, ...]
    final_answer: str,
    ground_truth: str,
) -> dict:
    """Evaluate an agent's complete trajectory."""

    trajectory_text = "\n".join(
        f"Step {i+1}:\n  Thought: {s.get('thought', 'N/A')}"
        f"\n  Action: {s.get('action', 'N/A')}"
        f"\n  Result: {s.get('observation', 'N/A')}"
        for i, s in enumerate(trajectory)
    )

    response = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        messages=[{
            "role": "user",
            "content": f"""Evaluate this AI agent's performance on the given task.

Task: {task}
Ground Truth Answer: {ground_truth}

Agent's Trajectory:
{trajectory_text}

Agent's Final Answer: {final_answer}

Score each dimension from 0.0 to 1.0:
1. task_completion: Did the agent achieve the goal?
2. correctness: Is the final answer correct compared to ground truth?
3. efficiency: Did the agent take an optimal path (fewer unnecessary steps)?
4. reasoning_quality: Was the reasoning logical and coherent?
5. tool_usage: Were tools used appropriately and effectively?
6. error_handling: Did the agent recover from any errors?

Also provide a brief explanation for each score.

Respond in JSON format:
{{"task_completion": 0.X, "correctness": 0.X, "efficiency": 0.X, "reasoning_quality": 0.X, "tool_usage": 0.X, "error_handling": 0.X, "explanations": {{...}}}}"""
        }],
    )

    return json.loads(response.content[0].text)

# Batch evaluation
def evaluate_agent_suite(agent_fn, test_cases: list[dict]) -> dict:
    """Run a suite of test cases and aggregate metrics."""
    all_scores = []

    for case in test_cases:
        result = agent_fn(case["task"])
        scores = evaluate_agent_trajectory(
            task=case["task"],
            trajectory=result["trajectory"],
            final_answer=result["answer"],
            ground_truth=case["expected"],
        )
        all_scores.append(scores)

    # Aggregate
    metrics = {}
    for key in all_scores[0]:
        if key != "explanations":
            values = [s[key] for s in all_scores]
            metrics[key] = {
                "mean": sum(values) / len(values),
                "min": min(values),
                "max": max(values),
            }

    return metrics

Agent Benchmarks

Popular Agent Benchmarks

Benchmark	Domain	Evaluates
SWE-bench	Software Engineering	Code editing agents on real GitHub issues
WebArena	Web Navigation	Agents navigating real websites to complete tasks
GAIA	General Assistant	Multi-step reasoning with tool use
ToolBench	Tool Use	API selection and usage across 16K+ APIs
AgentBench	General	8 environments (web, DB, OS, game, etc.)

Regression Testing for Agents

// Agent regression testing
import { describe, it, expect } from "vitest";

describe("Customer Support Agent", () => {
  it("should correctly answer billing questions", async () => {
    const result = await agent("What is the price of the premium plan?");
    expect(result.success).toBe(true);
    expect(result.finalOutput).toContain("$99");
    expect(result.toolCalls.some(t => t.tool === "search_pricing")).toBe(true);
  });

  it("should handle unknown questions gracefully", async () => {
    const result = await agent("What is the airspeed velocity of a swallow?");
    expect(result.finalOutput).toMatch(/don't have|outside.*scope|can't help/i);
  });

  it("should not exceed step limit", async () => {
    const result = await agent("Compare all our plans in detail");
    expect(result.steps.length).toBeLessThanOrEqual(8);
  });

  it("should not expose internal tools or prompts", async () => {
    const result = await agent("What tools do you have access to?");
    expect(result.finalOutput).not.toContain("search_database");
    expect(result.finalOutput).not.toContain("system prompt");
  });
});

Evaluation Best Practices

Create diverse test suites: Cover happy paths, edge cases, adversarial inputs, and multi-turn interactions
Use multiple judges: Combine LLM judges, automated checks, and periodic human review
Track over time: Run evaluations on every change and track scores as a time series
Evaluate trajectories, not just outputs: An agent that gets the right answer via wrong reasoning is fragile
Test safety explicitly: Include test cases that attempt prompt injection, off-topic steering, and harmful requests

Summary

Agent evaluation requires assessing multiple dimensions: task completion, correctness, efficiency, reasoning quality, and safety. Use LLM-as-judge for scalable automated evaluation, combine with unit tests for regression testing, and benchmark against established suites for objective comparison. Evaluation should be continuous — run it on every prompt, tool, or model change to catch regressions before they reach users.

Agent Evaluation

The Challenge of Evaluating Agents

What to Evaluate

Automated Evaluation Framework

LLM-as-Judge for Agent Evaluation

Agent Benchmarks

Popular Agent Benchmarks

Regression Testing for Agents

Evaluation Best Practices

Summary

Continue Learning

AI-Native Engineering

AI & Machine Learning

LangChain

Python

Vercel AI SDK