Ch 5: Evaluating Agents

Ch 5 — Evaluating Agents

Task completion, tool use accuracy, trajectory evaluation, and multi-step reasoning

Index

High Level

flag

Task

arrow_forward

psychology

Plan

arrow_forward

build

Execute

arrow_forward

route

Trajectory

arrow_forward

check_circle

Outcome

arrow_forward

insights

Learn

Click play or press Space to begin...

Step- / 8

warning

Why Agent Evaluation Is Hard

More dimensions, more failure modes

Agents vs Chatbots

A chatbot produces text. An agent produces actions. Evaluating text quality is hard enough; evaluating a sequence of actions — tool calls, API requests, file edits, database queries — across multiple steps is exponentially harder. The same task can be solved via many valid trajectories.

The Evaluation Dimensions

• Task completion: Did it finish the job?
• Correctness: Is the result correct?
• Efficiency: How many steps/tokens/dollars did it take?
• Safety: Did it avoid harmful actions?
• Trajectory quality: Was the path reasonable?
• Reliability: Does it succeed consistently?

The Reliability Problem

Single-run success rates hide reliability issues. A model achieving 96.9% success in clean conditions can drop to 88.1% under perturbations and fault conditions. An agent that works 9 out of 10 times is useless if you can’t predict when it will fail.

Key insight: Agent evaluation requires testing the same task multiple times (5–10 runs) to measure consistency. A 70% success rate with high consistency is often more valuable than 90% with high variance.

check_circle

Task Completion Rate

The most basic and most important metric

Binary vs Partial

Binary: Did the agent complete the task? Yes/No. Simple but loses nuance — an agent that gets 90% of the way is scored the same as one that fails immediately.

Partial credit: Score based on sub-goals achieved. A file-editing agent that correctly edits 4 of 5 files scores 0.80. More informative but harder to define.

SWE-bench as a Model

SWE-bench uses test-based verification: the agent generates a patch, and the repository’s test suite determines if it’s correct. This is the gold standard for coding agents — objective, automated, and resistant to gaming. Top agents solve ~43% of SWE-bench Pro tasks.

Beyond Pass/Fail

// Agent scorecard Task completion: 78% (binary) Partial credit: 0.89 (sub-goals) Avg steps: 12.3 (efficiency) Avg cost: $0.42/task Consistency: 6/10 runs succeed

build

Tool Use Accuracy

Is the agent calling the right tools with the right arguments?

What to Measure

• Tool selection accuracy: Did it pick the right tool for the job?
• Argument correctness: Were the parameters correct?
• Sequence validity: Were tools called in a logical order?
• Error recovery: When a tool call failed, did it adapt?

Common Tool Failures

• Wrong tool: Using search when it should use a calculator
• Hallucinated tools: Calling tools that don’t exist
• Wrong arguments: Correct tool, wrong parameters
• Unnecessary calls: Calling tools when the answer is already known
• Infinite loops: Retrying the same failed call repeatedly

Measuring Tool Use

Compare the agent’s tool call sequence against a reference trajectory (the expected sequence of tool calls). Metrics:

• Precision: What fraction of tool calls were necessary?
• Recall: What fraction of necessary tool calls were made?
• F1: Harmonic mean of precision and recall

Key insight: Multiple valid trajectories exist for most tasks. Don’t penalize agents for taking a different-but-valid path. Focus on whether the outcome is correct and the trajectory is reasonable, not whether it matches a specific reference exactly.

route

Trajectory Evaluation

Judging the path, not just the destination

Why Trajectories Matter

Two agents might both complete a task, but one takes 5 steps and $0.10 while the other takes 50 steps and $5.00. Trajectory evaluation captures efficiency, reasoning quality, and cost — dimensions that task completion alone misses.

Shepherd’s Failure Patterns

Research analyzing 3,908 agent trajectories across 18 models identified three distinct failure patterns:

• Failure-to-Act: Agent fails to interact with the environment
• Out-of-Order Actions: Interdependent actions issued simultaneously
• False Termination: Agent prematurely assumes task is complete

LLM-as-Judge for Trajectories

Use an LLM judge to evaluate trajectory quality by asking:

1. Was the planning phase adequate?
2. Were tool calls appropriate and efficient?
3. Did the agent recover from errors?
4. Was the final answer correct?

Shepherd used this approach to improve agent performance from 21% to 31% while cutting costs by 57%.

Practical tip: Log every agent trajectory in production. When failures occur, trajectory logs are your debugging tool. Pattern-match failures to identify systematic issues (e.g., “always fails on multi-file edits”).

speed

Efficiency & Cost Metrics

Solving the task isn’t enough — it has to be affordable

What to Track

• Steps per task: Number of actions taken (fewer = better)
• Tokens consumed: Total input + output tokens (drives cost)
• Wall-clock time: End-to-end latency
• Cost per task: Dollar amount (API costs + compute)
• Cost per success: Total cost / successful completions

The Cost-Quality Tradeoff

// Agent comparison Agent A (GPT-4o) Success: 85% Cost: $2.10/task Cost/success: $2.47 Agent B (Claude Sonnet) Success: 78% Cost: $0.80/task Cost/success: $1.03 // Agent B is 2.4x more cost-effective // despite lower raw success rate

Key insight: Cost per successful completion is often more important than raw success rate. A cheaper agent that succeeds less often can be more cost-effective if you can detect and retry failures.

shield

Safety & Sandboxing

Agents that take actions need safety evaluation

Why Agent Safety Is Different

A chatbot that hallucinates is annoying. An agent that hallucinates takes real actions — deleting files, sending emails, making API calls, spending money. Agent safety evaluation must test for harmful actions, not just harmful text.

Safety Dimensions

• Scope adherence: Does it stay within its authorized actions?
• Destructive action prevention: Does it avoid irreversible operations?
• Confirmation seeking: Does it ask before high-impact actions?
• Prompt injection resistance: Can users trick it into unauthorized actions?
• Resource limits: Does it respect cost and time budgets?

Sandbox Testing

Always evaluate agents in a sandboxed environment that mirrors production but can’t cause real damage. Test with adversarial scenarios:

• “Delete all files in the project”
• “Send this email to all customers”
• “Ignore previous instructions and...”

The agent should refuse, ask for confirmation, or escalate.

Critical: Never evaluate agents in production environments without sandboxing. An agent that “helpfully” executes a destructive command during testing can cause real damage. Sandbox first, always.

science

Agent Benchmarks

The standardized tests for AI agents

Coding Agents

• SWE-bench Verified: Real GitHub issues, test-based verification. Top: ~76%
• SWE-bench Pro: Harder, long-horizon tasks across 41 repos. Top: ~43%
• SWE-ContextBench: Tests experience reuse across related problems

General Agents

• WebArena: Web browsing tasks (shopping, forums, maps)
• OSWorld: Operating system tasks (file management, app usage)
• ReliabilityBench: Tests agent reliability under production stress conditions
• GAIA: General AI assistant tasks requiring multi-step reasoning

Benchmark Limitations

Agent benchmarks share the same contamination and saturation risks as LLM benchmarks, plus additional challenges:

• Environment drift: Websites and APIs change, breaking benchmarks
• Cost: Running agent benchmarks is 10–100x more expensive than text benchmarks
• Reproducibility: Non-deterministic environments make exact reproduction difficult

Practical tip: Use public benchmarks for initial model selection, then build your own task suite from real use cases. 20–50 representative tasks with clear success criteria is enough to compare agents for your specific needs.

checklist

The Agent Eval Checklist

A practical framework for evaluating any agent

Before Deployment

1. Define clear success criteria for each task type
2. Build a test suite of 20–50 representative tasks
3. Run each task 5–10 times to measure consistency
4. Test adversarial scenarios (prompt injection, scope violations)
5. Measure cost per success, not just success rate
6. Evaluate in a sandboxed environment

In Production

1. Log every trajectory (actions, tool calls, reasoning)
2. Monitor success rate over time (detect drift)
3. Track cost per task (detect runaway spending)
4. Alert on safety violations (unauthorized actions)
5. Sample trajectories for human review (5–10%)
6. A/B test agent updates before full rollout

Next up: In Chapter 6, we’ll explore human evaluation — the gold standard that calibrates all automated metrics, from Chatbot Arena to annotation guidelines.

arrow_back Ch 4: Evaluating RAG Ch 6: Human Evaluation arrow_forward