Ch 9 — Safety, Control & Failure Modes

Cascading failures, guardrails, sandboxing, kill switches, audit trails, and adversarial robustness
Modern MAS
warning
Threat
arrow_forward
shield
Guard
arrow_forward
lock
Contain
arrow_forward
visibility
Monitor
arrow_forward
emergency
Stop
arrow_forward
verified
Recover
-
Click play or press Space to begin...
Step- / 8
warning
Why Multi-Agent Safety Is Different
Cascading failures and distributed risk
The Stakes
A single LLM can hallucinate; a multi-agent system can hallucinate, act on the hallucination, and then have other agents build on that action before anyone notices. Failures cascade: Agent A writes bad code, Agent B deploys it, Agent C reports success. The attack surface is larger too — prompt injection in one agent can propagate through messages to others. Safety in MAS requires thinking about systemic risk, not just individual model alignment.
Pattern
Single agent: hallucinate Multi-agent: hallucinate → act → cascade // Systemic risk, not just model risk
Key insight: The danger is not one bad output — it is a chain of agents trusting each other’s mistakes.
bug_report
Common Failure Modes
A taxonomy of what goes wrong
Catalog
Infinite loops: agents keep requesting clarification or retrying without progress. Echo chambers: agents reinforce each other’s errors instead of correcting them. Goal drift: the original objective is lost as agents optimize for intermediate metrics. Resource exhaustion: unbounded token spend or API calls. Privilege escalation: an agent tricks another into using a tool it shouldn’t. Data leakage: private context from one agent leaks into another’s output. Map each failure mode to a specific mitigation.
Pattern
Loop → max_turns Echo → diversity + external check Drift → goal anchoring Exhaust → budgets Escalation → IAM Leak → context scoping
Key insight: Build a failure mode catalog for your system and map each entry to a concrete guard.
shield
Guardrails & Input/Output Filters
First line of defense
Layer
Input guardrails screen messages entering each agent for prompt injection, off-topic content, and policy violations. Output guardrails validate responses before they reach the next agent or external systems: schema checks, toxicity filters, PII redaction, and action allowlists. In multi-agent systems, guardrails must run at every agent boundary, not just the user-facing edge. Use fast, deterministic checks (regex, schema) before expensive LLM-based classifiers.
Pattern
Input: injection + policy filter Output: schema + PII + allowlist Every boundary, not just the edge // Fast checks first, LLM checks second
Key insight: Guardrails at the user edge only leave agent-to-agent messages completely unprotected.
lock
Containment & Sandboxing
Limiting blast radius
Architecture
Sandboxing limits what each agent can affect: file system access, network calls, database writes, and spending limits. Run code-executing agents in isolated containers with resource caps. Use budget envelopes: each task gets a token budget and dollar cap; the system halts if exceeded. Blast radius analysis: if this agent goes rogue, what is the worst it can do? Design so the answer is bounded and reversible.
Pattern
Container: isolated execution Budget: token + dollar cap Blast radius: worst case bounded // Reversible > irreversible actions
Key insight: Ask “if this agent goes rogue, what’s the worst case?” — then make that case survivable.
emergency
Kill Switches & Circuit Breakers
Stopping the system safely
Mechanism
A kill switch halts all agent activity immediately — essential for runaway loops or detected attacks. A circuit breaker is automatic: if error rate exceeds a threshold, the system degrades gracefully (e.g., falls back to single-agent mode or queues tasks for human review). Implement at multiple levels: per-agent, per-task, and system-wide. Test kill switches regularly — an untested emergency stop is not a stop at all.
Pattern
Kill switch: manual, immediate Circuit breaker: auto, threshold Fallback: single-agent or human // Test monthly
Key insight: An untested kill switch is a false sense of security — drill it regularly.
admin_panel_settings
Oversight & Audit Trails
Human accountability over autonomous systems
Governance
Audit trails record every decision, tool call, and message with timestamps, agent IDs, and trace IDs. They enable post-incident analysis, compliance reporting, and blame attribution. Oversight means humans can review any decision after the fact and intervene in real time for high-risk actions. Store audit logs in append-only, tamper-evident storage. Retention policies should match your regulatory requirements.
Pattern
Log: decision + tool + message Append-only storage Retention: match regulation // Tamper-evident for compliance
Key insight: If you cannot reconstruct why the system did what it did, you cannot govern it.
security
Adversarial Robustness
Prompt injection, data poisoning, and manipulation
Threats
Multi-agent systems face inter-agent prompt injection: a malicious input to Agent A crafts a message that hijacks Agent B. Data poisoning: corrupted memory or tool outputs that steer agents wrong. Social engineering: an external user manipulates the conversation to make agents bypass controls. Mitigations: input sanitization at every boundary, role-based message validation (reject unexpected performatives), anomaly detection on message patterns, and red teaming with adversarial scenarios.
Pattern
Inter-agent injection: sanitize boundaries Poisoning: validate tool outputs Social eng: anomaly detection // Red team quarterly
Key insight: The most dangerous injection is the one that passes through Agent A to hijack Agent B.
checklist
Safety Checklist
From threat model to production hardening
Summary
Build a threat model specific to your agent topology. Implement guardrails at every boundary, sandbox execution, set budgets, wire kill switches, and log everything. Red team before launch and quarterly after. Next and final chapter: production patterns and the future of multi-agent systems — orchestration at scale, cost engineering, and where the field is heading.
Pattern
Threat model → Guardrails → Sandbox Budgets → Kill switches → Audit Red team → Monitor → Iterate // Ch 10: production & future
Key insight: Safety is not a feature you add at the end — it is an architecture decision you make at the start.