Ch 9: Safety, Control & Failure Modes

Ch 9 — Safety, Control & Failure Modes

Cascading failures, guardrails, sandboxing, kill switches, audit trails, and adversarial robustness

Index

Modern MAS

warning

Threat

arrow_forward

shield

Guard

arrow_forward

lock

Contain

arrow_forward

visibility

Monitor

arrow_forward

emergency

Stop

arrow_forward

verified

Recover

Click play or press Space to begin...

Step- / 8

warning

Why Multi-Agent Safety Is Different

Cascading failures and distributed risk

The Stakes

A single LLM can hallucinate; a multi-agent system can hallucinate, act on the hallucination, and then have other agents build on that action before anyone notices. Failures cascade: Agent A writes bad code, Agent B deploys it, Agent C reports success. The attack surface is larger too — prompt injection in one agent can propagate through messages to others. Safety in MAS requires thinking about systemic risk, not just individual model alignment.

Pattern

Single agent: hallucinate Multi-agent: hallucinate → act → cascade // Systemic risk, not just model risk

Key insight: The danger is not one bad output — it is a chain of agents trusting each other’s mistakes.

bug_report

Common Failure Modes

A taxonomy of what goes wrong

Catalog

Infinite loops: agents keep requesting clarification or retrying without progress. Echo chambers: agents reinforce each other’s errors instead of correcting them. Goal drift: the original objective is lost as agents optimize for intermediate metrics. Resource exhaustion: unbounded token spend or API calls. Privilege escalation: an agent tricks another into using a tool it shouldn’t. Data leakage: private context from one agent leaks into another’s output. Map each failure mode to a specific mitigation.

Pattern

Loop → max_turns Echo → diversity + external check Drift → goal anchoring Exhaust → budgets Escalation → IAM Leak → context scoping

Key insight: Build a failure mode catalog for your system and map each entry to a concrete guard.

shield

Guardrails & Input/Output Filters

First line of defense

Layer

Input guardrails screen messages entering each agent for prompt injection, off-topic content, and policy violations. Output guardrails validate responses before they reach the next agent or external systems: schema checks, toxicity filters, PII redaction, and action allowlists. In multi-agent systems, guardrails must run at every agent boundary, not just the user-facing edge. Use fast, deterministic checks (regex, schema) before expensive LLM-based classifiers.

Pattern

Input: injection + policy filter Output: schema + PII + allowlist Every boundary, not just the edge // Fast checks first, LLM checks second

Key insight: Guardrails at the user edge only leave agent-to-agent messages completely unprotected.

lock

Containment & Sandboxing

Limiting blast radius

Architecture

Sandboxing limits what each agent can affect: file system access, network calls, database writes, and spending limits. Run code-executing agents in isolated containers with resource caps. Use budget envelopes: each task gets a token budget and dollar cap; the system halts if exceeded. Blast radius analysis: if this agent goes rogue, what is the worst it can do? Design so the answer is bounded and reversible.

Pattern

Container: isolated execution Budget: token + dollar cap Blast radius: worst case bounded // Reversible > irreversible actions

Key insight: Ask “if this agent goes rogue, what’s the worst case?” — then make that case survivable.

emergency

Kill Switches & Circuit Breakers

Stopping the system safely

Mechanism

A kill switch halts all agent activity immediately — essential for runaway loops or detected attacks. A circuit breaker is automatic: if error rate exceeds a threshold, the system degrades gracefully (e.g., falls back to single-agent mode or queues tasks for human review). Implement at multiple levels: per-agent, per-task, and system-wide. Test kill switches regularly — an untested emergency stop is not a stop at all.

Pattern

Kill switch: manual, immediate Circuit breaker: auto, threshold Fallback: single-agent or human // Test monthly

Key insight: An untested kill switch is a false sense of security — drill it regularly.

admin_panel_settings

Oversight & Audit Trails

Human accountability over autonomous systems

Governance

Audit trails record every decision, tool call, and message with timestamps, agent IDs, and trace IDs. They enable post-incident analysis, compliance reporting, and blame attribution. Oversight means humans can review any decision after the fact and intervene in real time for high-risk actions. Store audit logs in append-only, tamper-evident storage. Retention policies should match your regulatory requirements.

Pattern

Log: decision + tool + message Append-only storage Retention: match regulation // Tamper-evident for compliance

Key insight: If you cannot reconstruct why the system did what it did, you cannot govern it.

security

Adversarial Robustness

Prompt injection, data poisoning, and manipulation

Threats

Multi-agent systems face inter-agent prompt injection: a malicious input to Agent A crafts a message that hijacks Agent B. Data poisoning: corrupted memory or tool outputs that steer agents wrong. Social engineering: an external user manipulates the conversation to make agents bypass controls. Mitigations: input sanitization at every boundary, role-based message validation (reject unexpected performatives), anomaly detection on message patterns, and red teaming with adversarial scenarios.

Pattern

Inter-agent injection: sanitize boundaries Poisoning: validate tool outputs Social eng: anomaly detection // Red team quarterly

Key insight: The most dangerous injection is the one that passes through Agent A to hijack Agent B.

checklist

Safety Checklist

From threat model to production hardening

Summary

Build a threat model specific to your agent topology. Implement guardrails at every boundary, sandbox execution, set budgets, wire kill switches, and log everything. Red team before launch and quarterly after. Next and final chapter: production patterns and the future of multi-agent systems — orchestration at scale, cost engineering, and where the field is heading.

Pattern

Threat model → Guardrails → Sandbox Budgets → Kill switches → Audit Red team → Monitor → Iterate // Ch 10: production & future

Key insight: Safety is not a feature you add at the end — it is an architecture decision you make at the start.

arrow_back Ch 8: Evaluation, Benchmarks & Metrics Ch 10: Production Patterns & Future arrow_forward