Ch 7: LLM-Based Multi-Agent Frameworks

Ch 7 — LLM-Based Multi-Agent Frameworks

AutoGen-style patterns, role specialization, tools, conversation loops, HITL, and observability

Index

Modern MAS

smart_toy

Roles

arrow_forward

build

Scaffold

arrow_forward

construction

Tools

arrow_forward

forum

Chat

arrow_forward

person

Human

arrow_forward

rocket_launch

Ship

Click play or press Space to begin...

Step- / 8

smart_toy

The LLM Multi-Agent Landscape

From research prototypes to production stacks

Context

Since 2023, a wave of frameworks has emerged for orchestrating multiple LLM-powered agents: Microsoft’s AutoGen, open-source patterns like CrewAI, LangGraph for stateful agent graphs, and many others. They share a core idea: wrap LLM calls in agent abstractions with roles, tools, and conversation protocols. The differences lie in how much structure they impose (free chat vs rigid DAG), human-in-the-loop support, and observability. This chapter surveys the design patterns rather than endorsing a single library.

Pattern

Agent = LLM + role + tools Framework = orchestration glue // Patterns outlast libraries

Key insight: Learn the patterns (role specialization, conversation loops, tool routing) — frameworks change fast.

badge

Role Specialization & System Prompts

Giving each agent a clear identity

Pattern

The simplest multi-agent pattern: assign each agent a distinct system prompt that defines its role, expertise, constraints, and output format. A “Planner” agent decomposes tasks; a “Coder” writes code; a “Reviewer” critiques. Role clarity reduces scope creep (one agent trying to do everything) and makes failures attributable. Keep roles narrow and testable — if you cannot write a unit test for a role’s expected behavior, the role is too vague.

Pattern

Planner: decompose only Coder: implement + test Reviewer: critique + approve // One responsibility per agent

Key insight: If you cannot unit-test a role’s expected output, the role definition is too loose.

construction

Tool Use & Function Calling

Grounding agents in real actions

Capability

LLM agents become useful when they can call tools: search APIs, databases, code interpreters, file systems. Frameworks provide tool registries mapping function names to implementations. Key design choices: which agents get which tools (least privilege), approval gates for destructive tools (delete, pay, deploy), and retry/fallback when tools fail. Log every tool call with inputs, outputs, latency, and cost — this is your audit trail and debugging lifeline.

Pattern

Key insight: Tool access is your permission model — treat it like IAM, not a free buffet.

forum

Conversation Patterns

Round-robin, debate, reflection, and hierarchical chat

Patterns

Round-robin: agents take turns in a fixed order — simple, predictable, but can waste turns. Debate: two agents argue opposing positions; a judge picks the winner — good for reducing hallucination. Reflection: an agent critiques its own output before passing it on. Hierarchical: a manager agent delegates to specialists and synthesizes. Mix patterns: use debate for high-stakes decisions, round-robin for brainstorming, and hierarchical for execution. Always cap max turns.

Pattern

Round-robin: A → B → C → … Debate: pro vs con → judge Reflect: self-critique loop Hierarchy: manager → workers // Cap turns in every pattern

Key insight: Match the conversation pattern to the decision type — debate for judgment, hierarchy for execution.

person

Human-in-the-Loop

When to pause and ask

Design

Fully autonomous multi-agent systems are risky for high-stakes tasks. Human-in-the-loop (HITL) patterns: approval gates before tool execution, review checkpoints after planning, escalation when confidence is low or agents disagree. Design the UX so humans see a summary + diff, not a wall of agent chat. Track approval latency — if humans are the bottleneck, the system needs better defaults or tighter scoping, not more autonomy.

Pattern

gate: approve before execute checkpoint: review plan escalate: low confidence // Show summary, not raw chat

Key insight: Human review should see structured summaries, not raw multi-agent transcripts.

memory

Memory & State Management

Short-term, long-term, and shared memory

Architecture

Agents need short-term memory (current conversation context), long-term memory (past interactions, learned facts), and shared state (team knowledge base). Frameworks offer vector stores for semantic retrieval, key-value stores for structured facts, and conversation buffers with summarization. Pitfalls: stale memory (outdated facts never evicted), context overflow (stuffing too much into the prompt), and privacy leaks (Agent A reading Agent B’s private memory).

Pattern

Short: conversation buffer Long: vector store + KV Shared: team knowledge base // Evict stale, scope access

Key insight: Memory without eviction and access control becomes a liability, not an asset.

monitoring

Observability & Debugging

Seeing inside the black box

Practice

Multi-agent systems are hard to debug: failures cascade, blame is distributed, and logs are interleaved. Essential observability: trace IDs linking all messages in a task, agent-level metrics (tokens, latency, tool success rate), conversation replays with timestamps, and cost attribution per agent. Use structured logging (JSON) and build dashboards that show the critical path through the agent graph. Without this, production issues become archaeology.

Pattern

trace_id: links all messages metrics: tokens, latency, cost replay: conversation + timestamps // Dashboard the critical path

Key insight: If you cannot replay a failed task from logs, your observability is not ready for production.

checklist

Choosing & Composing Frameworks

Practical selection criteria

Guide

Evaluate frameworks on: conversation control (can you enforce your protocol?), tool integration (registry, approval gates), human-in-the-loop UX, observability (traces, cost tracking), state management (memory, persistence), and community/maintenance. Start with the simplest pattern that works (often one planner + one executor). Add agents only when you have clear role boundaries and measurable improvement. Next chapter: how to evaluate all of this rigorously.

Pattern

Control + Tools + HITL Observability + Memory Start simple, add agents with evidence // Ch 8: evaluation & benchmarks

Key insight: Add agents only when you have evidence they improve outcomes — more agents ≠ better system.

arrow_back Ch 6: Emergence, Game Theory & Incentives Ch 8: Evaluation, Benchmarks & Metrics arrow_forward