Ch 8 — Evaluation, Benchmarks & Metrics

Three-layer eval, benchmarks, cost-quality trade-offs, regression testing, and production monitoring
Modern MAS
target
Define
arrow_forward
science
Measure
arrow_forward
compare
Compare
arrow_forward
analytics
Analyze
arrow_forward
loop
Iterate
arrow_forward
verified
Ship
-
Click play or press Space to begin...
Step- / 8
target
Why Evaluation Is Hard for MAS
More moving parts, more failure modes
The Problem
Single-model evaluation is already challenging; multi-agent systems add interaction effects, non-determinism from multiple LLM calls, and emergent behaviors that don’t appear in unit tests. A system can pass every agent-level test and still fail at the system level (deadlocks, redundant work, cost explosions). You need evaluation at three layers: individual agent capability, pairwise interaction quality, and end-to-end task success.
Pattern
Layer 1: agent capability Layer 2: interaction quality Layer 3: end-to-end task success // All three must pass
Key insight: Agent-level tests passing does not guarantee system-level success — test the interactions.
leaderboard
Benchmarks for Multi-Agent Systems
Standardized tasks and environments
Landscape
Research benchmarks include collaborative coding tasks (SWE-bench with agent teams), debate and persuasion arenas, negotiation games, and multi-agent reasoning challenges. For production, build internal benchmarks from real task logs: sample completed tasks, replay them with the new system, and compare outcomes. Benchmarks should cover happy paths, edge cases (tool failures, ambiguous instructions), and adversarial inputs (prompt injection across agents).
Pattern
Research: SWE-bench, debate arenas Internal: sampled task replays Cover: happy + edge + adversarial // Version your benchmark suite
Key insight: The most valuable benchmark is your own task log — it reflects real distribution, not synthetic puzzles.
speed
Task Success vs Process Cost
Outcome quality and resource consumption
Trade-off
Task success rate alone is insufficient. A system that solves 95% of tasks but costs 10× more tokens, takes 5× longer, or requires 3× more human interventions may be worse than a simpler baseline. Track: success rate, total tokens (input + output across all agents), wall-clock latency, tool call count, human escalation rate, and dollar cost. Report these as a Pareto frontier: which configurations dominate on both quality and cost?
Pattern
success_rate total_tokens, latency_ms tool_calls, human_escalations cost_usd // Pareto frontier across configs
Key insight: Always report cost alongside accuracy — a 2% accuracy gain at 5× cost is rarely worth it.
chat
Multi-Turn & Conversation-Level Eval
Beyond single-response quality
Method
Multi-agent conversations span many turns. Evaluate: conversation coherence (do agents stay on topic?), information flow (does relevant info reach the right agent?), turn efficiency (how many turns to reach a decision?), and termination quality (did the conversation end at the right time?). Use LLM-as-judge on sampled transcripts with rubrics for each dimension. Compare against golden transcripts from expert-annotated examples.
Pattern
Coherence: on-topic? Info flow: right agent got it? Efficiency: turns to decision Termination: timely end? // LLM-as-judge + golden transcripts
Key insight: A conversation that reaches the right answer in 20 turns vs 5 has a quality problem, not just a cost problem.
bug_report
Regression Testing for Agent Systems
Catching regressions before production
Practice
Every prompt change, model upgrade, or tool update can break multi-agent behavior. Build a regression suite: a set of deterministic-seed replays (where possible) or statistical tests (run N trials, compare distributions). Track behavioral fingerprints: tool call sequences, message counts, and decision patterns. Alert when fingerprints shift beyond a threshold. Store golden outputs and diff against them in CI.
Pattern
Replay: seed + inputs Fingerprint: tool sequence + msg count Alert: drift > threshold // Run in CI on every PR
Key insight: Prompt changes are code changes — they deserve the same regression testing discipline.
diversity_3
Ablation & Agent Contribution Analysis
Is each agent earning its keep?
Method
Ablation studies: remove one agent at a time and measure impact on task success and cost. If removing an agent has no effect, it’s dead weight. If removing it crashes the system, it’s a single point of failure that needs redundancy. Also measure marginal contribution: what does adding a third reviewer agent improve over two? Diminishing returns are common — more agents often means more tokens for marginal gains.
Pattern
Remove agent → measure delta No effect = remove it System crash = add redundancy // Diminishing returns are real
Key insight: If you cannot show an agent’s marginal contribution with data, you probably don’t need it.
monitoring
Production Monitoring & Alerts
Keeping the system healthy after launch
Operations
Production monitoring extends evaluation into real time. Key signals: success rate (rolling window), p95 latency, token burn rate, error rate by agent, human escalation trend, and conversation length distribution. Set alerts for: success rate drop > 5%, cost spike > 2× baseline, any agent error rate > 10%, and conversation length > 2× median. Build runbooks for each alert with diagnosis steps and rollback procedures.
Pattern
success_rate < threshold → alert cost > 2× baseline → alert agent_errors > 10% → alert // Runbook per alert
Key insight: Every alert needs a runbook — an alert without a response plan is just noise.
checklist
Building Your Eval Stack
From benchmarks to production monitoring
Roadmap
Start with end-to-end task tests from real logs. Add agent-level unit tests for each role. Build regression suites in CI. Run ablation studies before adding agents. Deploy with production monitoring and runbooks. Review metrics weekly. Next chapter: safety, control, and failure modes — what happens when evaluation misses something and the system goes wrong.
Pattern
Task tests → Agent unit tests Regression CI → Ablation Prod monitoring → Runbooks // Ch 9: safety & failure modes
Key insight: Evaluation is not a phase — it’s a continuous practice that runs from development through production.