Ch 7 — LLM-Based Multi-Agent Frameworks

AutoGen-style patterns, role specialization, tools, conversation loops, HITL, and observability
Modern MAS
smart_toy
Roles
arrow_forward
build
Scaffold
arrow_forward
construction
Tools
arrow_forward
forum
Chat
arrow_forward
person
Human
arrow_forward
rocket_launch
Ship
-
Click play or press Space to begin...
Step- / 8
smart_toy
The LLM Multi-Agent Landscape
From research prototypes to production stacks
Context
Since 2023, a wave of frameworks has emerged for orchestrating multiple LLM-powered agents: Microsoft’s AutoGen, open-source patterns like CrewAI, LangGraph for stateful agent graphs, and many others. They share a core idea: wrap LLM calls in agent abstractions with roles, tools, and conversation protocols. The differences lie in how much structure they impose (free chat vs rigid DAG), human-in-the-loop support, and observability. This chapter surveys the design patterns rather than endorsing a single library.
Pattern
Agent = LLM + role + tools Framework = orchestration glue // Patterns outlast libraries
Key insight: Learn the patterns (role specialization, conversation loops, tool routing) — frameworks change fast.
badge
Role Specialization & System Prompts
Giving each agent a clear identity
Pattern
The simplest multi-agent pattern: assign each agent a distinct system prompt that defines its role, expertise, constraints, and output format. A “Planner” agent decomposes tasks; a “Coder” writes code; a “Reviewer” critiques. Role clarity reduces scope creep (one agent trying to do everything) and makes failures attributable. Keep roles narrow and testable — if you cannot write a unit test for a role’s expected behavior, the role is too vague.
Pattern
Planner: decompose only Coder: implement + test Reviewer: critique + approve // One responsibility per agent
Key insight: If you cannot unit-test a role’s expected output, the role definition is too loose.
construction
Tool Use & Function Calling
Grounding agents in real actions
Capability
LLM agents become useful when they can call tools: search APIs, databases, code interpreters, file systems. Frameworks provide tool registries mapping function names to implementations. Key design choices: which agents get which tools (least privilege), approval gates for destructive tools (delete, pay, deploy), and retry/fallback when tools fail. Log every tool call with inputs, outputs, latency, and cost — this is your audit trail and debugging lifeline.
Pattern
register(tool, schema, agent) gate: human_approve(destructive) log: input, output, ms, cost // Least privilege per agent
Key insight: Tool access is your permission model — treat it like IAM, not a free buffet.
forum
Conversation Patterns
Round-robin, debate, reflection, and hierarchical chat
Patterns
Round-robin: agents take turns in a fixed order — simple, predictable, but can waste turns. Debate: two agents argue opposing positions; a judge picks the winner — good for reducing hallucination. Reflection: an agent critiques its own output before passing it on. Hierarchical: a manager agent delegates to specialists and synthesizes. Mix patterns: use debate for high-stakes decisions, round-robin for brainstorming, and hierarchical for execution. Always cap max turns.
Pattern
Round-robin: A → B → C → … Debate: pro vs con → judge Reflect: self-critique loop Hierarchy: manager → workers // Cap turns in every pattern
Key insight: Match the conversation pattern to the decision type — debate for judgment, hierarchy for execution.
person
Human-in-the-Loop
When to pause and ask
Design
Fully autonomous multi-agent systems are risky for high-stakes tasks. Human-in-the-loop (HITL) patterns: approval gates before tool execution, review checkpoints after planning, escalation when confidence is low or agents disagree. Design the UX so humans see a summary + diff, not a wall of agent chat. Track approval latency — if humans are the bottleneck, the system needs better defaults or tighter scoping, not more autonomy.
Pattern
gate: approve before execute checkpoint: review plan escalate: low confidence // Show summary, not raw chat
Key insight: Human review should see structured summaries, not raw multi-agent transcripts.
memory
Memory & State Management
Short-term, long-term, and shared memory
Architecture
Agents need short-term memory (current conversation context), long-term memory (past interactions, learned facts), and shared state (team knowledge base). Frameworks offer vector stores for semantic retrieval, key-value stores for structured facts, and conversation buffers with summarization. Pitfalls: stale memory (outdated facts never evicted), context overflow (stuffing too much into the prompt), and privacy leaks (Agent A reading Agent B’s private memory).
Pattern
Short: conversation buffer Long: vector store + KV Shared: team knowledge base // Evict stale, scope access
Key insight: Memory without eviction and access control becomes a liability, not an asset.
monitoring
Observability & Debugging
Seeing inside the black box
Practice
Multi-agent systems are hard to debug: failures cascade, blame is distributed, and logs are interleaved. Essential observability: trace IDs linking all messages in a task, agent-level metrics (tokens, latency, tool success rate), conversation replays with timestamps, and cost attribution per agent. Use structured logging (JSON) and build dashboards that show the critical path through the agent graph. Without this, production issues become archaeology.
Pattern
trace_id: links all messages metrics: tokens, latency, cost replay: conversation + timestamps // Dashboard the critical path
Key insight: If you cannot replay a failed task from logs, your observability is not ready for production.
checklist
Choosing & Composing Frameworks
Practical selection criteria
Guide
Evaluate frameworks on: conversation control (can you enforce your protocol?), tool integration (registry, approval gates), human-in-the-loop UX, observability (traces, cost tracking), state management (memory, persistence), and community/maintenance. Start with the simplest pattern that works (often one planner + one executor). Add agents only when you have clear role boundaries and measurable improvement. Next chapter: how to evaluate all of this rigorously.
Pattern
Control + Tools + HITL Observability + Memory Start simple, add agents with evidence // Ch 8: evaluation & benchmarks
Key insight: Add agents only when you have evidence they improve outcomes — more agents ≠ better system.