Ch 6 — Orchestration at Production Scale

Going from one agent to many — isolation, coordination, and monitoring
High Level
inbox
Queue
arrow_forward
account_tree
Worktree
arrow_forward
smart_toy
Agent
arrow_forward
verified
Verify
arrow_forward
merge
Aggregate
arrow_forward
monitoring
Monitor
-
Click play or press Space to begin...
Step- / 8
hub
The Scaling Challenge
One agent is easy. Ten agents working together is an engineering problem.
What Changes at Scale
Running one background agent is straightforward — submit a task, get a PR. Running ten agents in parallel introduces new problems: file conflicts (two agents editing the same file), resource contention (agents competing for API rate limits), result aggregation (combining outputs from multiple agents), and failure handling (what happens when agent 3 of 10 fails?). Orchestration is the engineering discipline of solving these problems.
The Factory Floor Analogy
Think of it like a factory floor. One worker can use any workbench. Ten workers need assigned stations (isolation), a foreman (coordinator), a parts queue (task queue), quality inspection (verification), and a dashboard (monitoring). The individual worker’s skill hasn’t changed — what changed is the system around them.
Key insight: Orchestration is not about making individual agents better. It’s about building the system that lets many agents work together safely and efficiently.
account_tree
Git Worktree Isolation
The standard primitive for parallel agent safety
The Problem It Solves
Multiple agents working in a shared directory create severe failures: file collisions (agents overwrite each other’s edits), context contamination (agents read partially modified files), git index corruption (concurrent staging operations), and conversation confusion (agents lose track of the actual codebase state). Full repository clones waste disk space and complicate rebasing.
How Worktrees Work
Git worktrees give each agent its own checked-out working directory while sharing a single .git object store. Each worktree has its own HEAD, index, and working tree files. Creation is fast and lightweight. This is now the standard isolation primitive — Claude Code, Codex, and Cursor all support --worktree flags for automatic creation, management, and cleanup.
Worktree commands
# Create a worktree for an agent task git worktree add -b feature-auth ../project-auth main # List active worktrees git worktree list # Clean up when the agent is done git worktree remove ../project-auth
inbox
Task Queues & Coordination
Getting the right tasks to the right agents
The Task Queue Pattern
A task queue sits between task sources (tickets, Slack messages, CI triggers) and agents. Tasks enter the queue with metadata: priority, estimated complexity, required capabilities, and file scope. The orchestrator assigns tasks to available agents, ensuring no two agents work on overlapping file sets. When an agent finishes, it picks up the next task from the queue.
Conflict Prevention
The orchestrator’s most important job: preventing file-level conflicts. Before assigning a task, it checks the file scope against all currently running agents. If agent A is modifying auth/login.ts, no other agent gets a task that touches that file until agent A finishes. This is a simple lock mechanism, but it prevents the most common source of parallel agent failures.
Key insight: The task queue is what turns a collection of independent agents into a coordinated system. Without it, you’re just running agents and hoping they don’t collide.
architecture
The Blueprint Pattern Revisited
Stripe’s approach to production-scale orchestration
Blueprints at Scale
In Chapter 2 we introduced blueprints as workflows combining deterministic and agent steps. At production scale, blueprints become the unit of orchestration. Each task type has a blueprint: “bug fix” blueprint, “dependency update” blueprint, “new endpoint” blueprint. The orchestrator matches incoming tasks to blueprints, and the blueprint defines the exact sequence of steps, quality gates, and retry policies.
Bounded Retry Rounds
Stripe’s Minions use bounded retry rounds with feedback loops. If the agent’s first attempt fails CI, it gets the error log and tries again — but only for a fixed number of rounds. If it can’t pass after 3 attempts, the task is escalated to a human. This prevents doom loops (agents retrying forever) while giving the agent a fair chance to self-correct.
Key insight: Blueprints are what make agent output predictable. Without them, each agent invents its own workflow. With them, every bug fix follows the same steps, every dependency update goes through the same gates. Predictability enables trust at scale.
merge
Result Aggregation
Combining outputs from multiple agents into a coherent whole
The Aggregation Problem
When 10 agents each produce a PR, someone needs to ensure they all work together. Agent A’s changes might conflict with Agent B’s changes at the semantic level even if there are no git merge conflicts. The aggregation layer merges results, runs integration tests, and identifies cross-PR conflicts before presenting the batch for human review.
Merge Ordering
The order in which PRs are merged matters. If PR 3 depends on changes in PR 1, merging them out of order breaks the build. The aggregation layer builds a dependency graph of PRs and suggests a merge order. For independent PRs, it can merge in parallel. For dependent PRs, it enforces sequential merging with CI verification between each step.
Key insight: Result aggregation is the most underappreciated part of multi-agent orchestration. Getting 10 agents to each produce good PRs is the easy part. Getting those 10 PRs to work together is the hard part.
monitoring
Monitoring an Agent Fleet
Dashboards, alerts, and the metrics that matter
Fleet Metrics
Task throughput — tasks completed per hour/day. Success rate — percentage of tasks that produce mergeable PRs. Time to PR — from task submission to PR ready for review. Review acceptance rate — percentage of agent PRs merged without significant changes. Cost per task — API tokens + compute per completed task. Doom loop rate — percentage of tasks that hit the retry limit.
Alert Conditions
Set alerts for: success rate drops below threshold (something changed in the codebase or the model), cost per task spikes (agents are using more tokens than expected — possible doom loops), queue depth growing (agents aren’t keeping up with incoming tasks), and review backlog growing (agents are producing PRs faster than humans can review them).
Key insight: The review acceptance rate is the single most important metric. If agents produce PRs that consistently need major rework, the system isn’t saving time — it’s creating work. Target 70%+ acceptance rate before scaling up.
security
Safety at Scale
Preventing runaway agents and protecting production
Guardrails
Token budgets — each task gets a maximum token spend. If the agent exceeds it, the task is killed and escalated. Time limits — no task runs longer than N minutes. File scope locks — agents can only modify files in their assigned scope. Branch protection — agents can never push to main/production branches directly. Secrets isolation — agents never have access to production credentials.
The Kill Switch
Every orchestration system needs a global kill switch that stops all running agents immediately. This is your emergency brake for when something goes wrong at scale — a model update that degrades quality, a codebase change that confuses all agents, or a security incident. The kill switch should be accessible to any team lead, not just the platform team.
Key insight: Safety guardrails are not optional at scale. One runaway agent is a nuisance. Ten runaway agents burning through your API budget while producing broken PRs is an incident. Build the guardrails before you scale.
stacked_line_chart
The Realistic Ramp-Up
From 1 agent to a fleet — the practical timeline
Month 1–2: Single Agent
One agent, one task type, manual task submission. Learn the patterns, measure success rate, build review workflows. No orchestration needed — just a developer submitting tasks and reviewing PRs.
Month 3–4: Small Fleet
2–5 agents, multiple task types, basic task queue. Add worktree isolation and file scope locks. Start tracking fleet metrics. This is where you discover the coordination challenges.
Month 5+: Production Fleet
5–20 agents, blueprints for common task types, automated task routing, full monitoring dashboard. Result aggregation for multi-agent migrations. At this point, the orchestration system is itself a product that needs maintenance and improvement.
Key insight: The ramp-up timeline is intentionally slow. Each phase builds the infrastructure and trust needed for the next. Teams that try to jump to 20 agents on day one invariably scale back to 2 after the first week of chaos.