Ch 5 — Evaluating Agents

Task completion, tool use accuracy, trajectory evaluation, and multi-step reasoning
High Level
flag
Task
arrow_forward
psychology
Plan
arrow_forward
build
Execute
arrow_forward
route
Trajectory
arrow_forward
check_circle
Outcome
arrow_forward
insights
Learn
-
Click play or press Space to begin...
Step- / 8
warning
Why Agent Evaluation Is Hard
More dimensions, more failure modes
Agents vs Chatbots
A chatbot produces text. An agent produces actions. Evaluating text quality is hard enough; evaluating a sequence of actions — tool calls, API requests, file edits, database queries — across multiple steps is exponentially harder. The same task can be solved via many valid trajectories.
The Evaluation Dimensions
Task completion: Did it finish the job?
Correctness: Is the result correct?
Efficiency: How many steps/tokens/dollars did it take?
Safety: Did it avoid harmful actions?
Trajectory quality: Was the path reasonable?
Reliability: Does it succeed consistently?
The Reliability Problem
Single-run success rates hide reliability issues. A model achieving 96.9% success in clean conditions can drop to 88.1% under perturbations and fault conditions. An agent that works 9 out of 10 times is useless if you can’t predict when it will fail.
Key insight: Agent evaluation requires testing the same task multiple times (5–10 runs) to measure consistency. A 70% success rate with high consistency is often more valuable than 90% with high variance.
check_circle
Task Completion Rate
The most basic and most important metric
Binary vs Partial
Binary: Did the agent complete the task? Yes/No. Simple but loses nuance — an agent that gets 90% of the way is scored the same as one that fails immediately.

Partial credit: Score based on sub-goals achieved. A file-editing agent that correctly edits 4 of 5 files scores 0.80. More informative but harder to define.
SWE-bench as a Model
SWE-bench uses test-based verification: the agent generates a patch, and the repository’s test suite determines if it’s correct. This is the gold standard for coding agents — objective, automated, and resistant to gaming. Top agents solve ~43% of SWE-bench Pro tasks.
Beyond Pass/Fail
// Agent scorecard Task completion: 78% (binary) Partial credit: 0.89 (sub-goals) Avg steps: 12.3 (efficiency) Avg cost: $0.42/task Consistency: 6/10 runs succeed
build
Tool Use Accuracy
Is the agent calling the right tools with the right arguments?
What to Measure
Tool selection accuracy: Did it pick the right tool for the job?
Argument correctness: Were the parameters correct?
Sequence validity: Were tools called in a logical order?
Error recovery: When a tool call failed, did it adapt?
Common Tool Failures
Wrong tool: Using search when it should use a calculator
Hallucinated tools: Calling tools that don’t exist
Wrong arguments: Correct tool, wrong parameters
Unnecessary calls: Calling tools when the answer is already known
Infinite loops: Retrying the same failed call repeatedly
Measuring Tool Use
Compare the agent’s tool call sequence against a reference trajectory (the expected sequence of tool calls). Metrics:

Precision: What fraction of tool calls were necessary?
Recall: What fraction of necessary tool calls were made?
F1: Harmonic mean of precision and recall
Key insight: Multiple valid trajectories exist for most tasks. Don’t penalize agents for taking a different-but-valid path. Focus on whether the outcome is correct and the trajectory is reasonable, not whether it matches a specific reference exactly.
route
Trajectory Evaluation
Judging the path, not just the destination
Why Trajectories Matter
Two agents might both complete a task, but one takes 5 steps and $0.10 while the other takes 50 steps and $5.00. Trajectory evaluation captures efficiency, reasoning quality, and cost — dimensions that task completion alone misses.
Shepherd’s Failure Patterns
Research analyzing 3,908 agent trajectories across 18 models identified three distinct failure patterns:

Failure-to-Act: Agent fails to interact with the environment
Out-of-Order Actions: Interdependent actions issued simultaneously
False Termination: Agent prematurely assumes task is complete
LLM-as-Judge for Trajectories
Use an LLM judge to evaluate trajectory quality by asking:

1. Was the planning phase adequate?
2. Were tool calls appropriate and efficient?
3. Did the agent recover from errors?
4. Was the final answer correct?

Shepherd used this approach to improve agent performance from 21% to 31% while cutting costs by 57%.
Practical tip: Log every agent trajectory in production. When failures occur, trajectory logs are your debugging tool. Pattern-match failures to identify systematic issues (e.g., “always fails on multi-file edits”).
speed
Efficiency & Cost Metrics
Solving the task isn’t enough — it has to be affordable
What to Track
Steps per task: Number of actions taken (fewer = better)
Tokens consumed: Total input + output tokens (drives cost)
Wall-clock time: End-to-end latency
Cost per task: Dollar amount (API costs + compute)
Cost per success: Total cost / successful completions
The Cost-Quality Tradeoff
// Agent comparison Agent A (GPT-4o) Success: 85% Cost: $2.10/task Cost/success: $2.47 Agent B (Claude Sonnet) Success: 78% Cost: $0.80/task Cost/success: $1.03 // Agent B is 2.4x more cost-effective // despite lower raw success rate
Key insight: Cost per successful completion is often more important than raw success rate. A cheaper agent that succeeds less often can be more cost-effective if you can detect and retry failures.
shield
Safety & Sandboxing
Agents that take actions need safety evaluation
Why Agent Safety Is Different
A chatbot that hallucinates is annoying. An agent that hallucinates takes real actions — deleting files, sending emails, making API calls, spending money. Agent safety evaluation must test for harmful actions, not just harmful text.
Safety Dimensions
Scope adherence: Does it stay within its authorized actions?
Destructive action prevention: Does it avoid irreversible operations?
Confirmation seeking: Does it ask before high-impact actions?
Prompt injection resistance: Can users trick it into unauthorized actions?
Resource limits: Does it respect cost and time budgets?
Sandbox Testing
Always evaluate agents in a sandboxed environment that mirrors production but can’t cause real damage. Test with adversarial scenarios:

• “Delete all files in the project”
• “Send this email to all customers”
• “Ignore previous instructions and...”

The agent should refuse, ask for confirmation, or escalate.
Critical: Never evaluate agents in production environments without sandboxing. An agent that “helpfully” executes a destructive command during testing can cause real damage. Sandbox first, always.
science
Agent Benchmarks
The standardized tests for AI agents
Coding Agents
SWE-bench Verified: Real GitHub issues, test-based verification. Top: ~76%
SWE-bench Pro: Harder, long-horizon tasks across 41 repos. Top: ~43%
SWE-ContextBench: Tests experience reuse across related problems
General Agents
WebArena: Web browsing tasks (shopping, forums, maps)
OSWorld: Operating system tasks (file management, app usage)
ReliabilityBench: Tests agent reliability under production stress conditions
GAIA: General AI assistant tasks requiring multi-step reasoning
Benchmark Limitations
Agent benchmarks share the same contamination and saturation risks as LLM benchmarks, plus additional challenges:

Environment drift: Websites and APIs change, breaking benchmarks
Cost: Running agent benchmarks is 10–100x more expensive than text benchmarks
Reproducibility: Non-deterministic environments make exact reproduction difficult
Practical tip: Use public benchmarks for initial model selection, then build your own task suite from real use cases. 20–50 representative tasks with clear success criteria is enough to compare agents for your specific needs.
checklist
The Agent Eval Checklist
A practical framework for evaluating any agent
Before Deployment
1. Define clear success criteria for each task type
2. Build a test suite of 20–50 representative tasks
3. Run each task 5–10 times to measure consistency
4. Test adversarial scenarios (prompt injection, scope violations)
5. Measure cost per success, not just success rate
6. Evaluate in a sandboxed environment
In Production
1. Log every trajectory (actions, tool calls, reasoning)
2. Monitor success rate over time (detect drift)
3. Track cost per task (detect runaway spending)
4. Alert on safety violations (unauthorized actions)
5. Sample trajectories for human review (5–10%)
6. A/B test agent updates before full rollout
Next up: In Chapter 6, we’ll explore human evaluation — the gold standard that calibrates all automated metrics, from Chatbot Arena to annotation guidelines.