Ch 8: Evaluation, Benchmarks & Metrics

Ch 8 — Evaluation, Benchmarks & Metrics

Three-layer eval, benchmarks, cost-quality trade-offs, regression testing, and production monitoring

Index

Modern MAS

target

Define

arrow_forward

science

Measure

arrow_forward

compare

Compare

arrow_forward

analytics

Analyze

arrow_forward

loop

Iterate

arrow_forward

verified

Ship

Click play or press Space to begin...

Step- / 8

target

Why Evaluation Is Hard for MAS

More moving parts, more failure modes

The Problem

Single-model evaluation is already challenging; multi-agent systems add interaction effects, non-determinism from multiple LLM calls, and emergent behaviors that don’t appear in unit tests. A system can pass every agent-level test and still fail at the system level (deadlocks, redundant work, cost explosions). You need evaluation at three layers: individual agent capability, pairwise interaction quality, and end-to-end task success.

Pattern

Layer 1: agent capability Layer 2: interaction quality Layer 3: end-to-end task success // All three must pass

Key insight: Agent-level tests passing does not guarantee system-level success — test the interactions.

leaderboard

Benchmarks for Multi-Agent Systems

Standardized tasks and environments

Landscape

Research benchmarks include collaborative coding tasks (SWE-bench with agent teams), debate and persuasion arenas, negotiation games, and multi-agent reasoning challenges. For production, build internal benchmarks from real task logs: sample completed tasks, replay them with the new system, and compare outcomes. Benchmarks should cover happy paths, edge cases (tool failures, ambiguous instructions), and adversarial inputs (prompt injection across agents).

Pattern

Research: SWE-bench, debate arenas Internal: sampled task replays Cover: happy + edge + adversarial // Version your benchmark suite

Key insight: The most valuable benchmark is your own task log — it reflects real distribution, not synthetic puzzles.

speed

Task Success vs Process Cost

Outcome quality and resource consumption

Trade-off

Task success rate alone is insufficient. A system that solves 95% of tasks but costs 10× more tokens, takes 5× longer, or requires 3× more human interventions may be worse than a simpler baseline. Track: success rate, total tokens (input + output across all agents), wall-clock latency, tool call count, human escalation rate, and dollar cost. Report these as a Pareto frontier: which configurations dominate on both quality and cost?

Pattern

success_rate total_tokens, latency_ms tool_calls, human_escalations cost_usd // Pareto frontier across configs

Key insight: Always report cost alongside accuracy — a 2% accuracy gain at 5× cost is rarely worth it.

chat

Multi-Turn & Conversation-Level Eval

Beyond single-response quality

Method

Multi-agent conversations span many turns. Evaluate: conversation coherence (do agents stay on topic?), information flow (does relevant info reach the right agent?), turn efficiency (how many turns to reach a decision?), and termination quality (did the conversation end at the right time?). Use LLM-as-judge on sampled transcripts with rubrics for each dimension. Compare against golden transcripts from expert-annotated examples.

Pattern

Coherence: on-topic? Info flow: right agent got it? Efficiency: turns to decision Termination: timely end? // LLM-as-judge + golden transcripts

Key insight: A conversation that reaches the right answer in 20 turns vs 5 has a quality problem, not just a cost problem.

bug_report

Regression Testing for Agent Systems

Catching regressions before production

Practice

Every prompt change, model upgrade, or tool update can break multi-agent behavior. Build a regression suite: a set of deterministic-seed replays (where possible) or statistical tests (run N trials, compare distributions). Track behavioral fingerprints: tool call sequences, message counts, and decision patterns. Alert when fingerprints shift beyond a threshold. Store golden outputs and diff against them in CI.

Pattern

Replay: seed + inputs Fingerprint: tool sequence + msg count Alert: drift > threshold // Run in CI on every PR

Key insight: Prompt changes are code changes — they deserve the same regression testing discipline.

diversity_3

Ablation & Agent Contribution Analysis

Is each agent earning its keep?

Method

Ablation studies: remove one agent at a time and measure impact on task success and cost. If removing an agent has no effect, it’s dead weight. If removing it crashes the system, it’s a single point of failure that needs redundancy. Also measure marginal contribution: what does adding a third reviewer agent improve over two? Diminishing returns are common — more agents often means more tokens for marginal gains.

Pattern

Remove agent → measure delta No effect = remove it System crash = add redundancy // Diminishing returns are real

Key insight: If you cannot show an agent’s marginal contribution with data, you probably don’t need it.

monitoring

Production Monitoring & Alerts

Keeping the system healthy after launch

Operations

Production monitoring extends evaluation into real time. Key signals: success rate (rolling window), p95 latency, token burn rate, error rate by agent, human escalation trend, and conversation length distribution. Set alerts for: success rate drop > 5%, cost spike > 2× baseline, any agent error rate > 10%, and conversation length > 2× median. Build runbooks for each alert with diagnosis steps and rollback procedures.

Pattern

success_rate < threshold → alert cost > 2× baseline → alert agent_errors > 10% → alert // Runbook per alert

Key insight: Every alert needs a runbook — an alert without a response plan is just noise.

checklist

Building Your Eval Stack

From benchmarks to production monitoring

Roadmap

Start with end-to-end task tests from real logs. Add agent-level unit tests for each role. Build regression suites in CI. Run ablation studies before adding agents. Deploy with production monitoring and runbooks. Review metrics weekly. Next chapter: safety, control, and failure modes — what happens when evaluation misses something and the system goes wrong.

Pattern

Task tests → Agent unit tests Regression CI → Ablation Prod monitoring → Runbooks // Ch 9: safety & failure modes

Key insight: Evaluation is not a phase — it’s a continuous practice that runs from development through production.

arrow_back Ch 7: LLM-Based Multi-Agent Frameworks Ch 9: Safety, Control & Failure Modes arrow_forward