Ch 7 — Orchestration at Scale

Production systems running hundreds of agents: Symphony, Minions, and the maturity model
High Level
task
Intake
arrow_forward
route
Dispatch
arrow_forward
group
Agents
arrow_forward
rate_review
Review
arrow_forward
merge
Merge
arrow_forward
monitoring
Monitor
-
Click play or press Space to begin...
Step- / 8
music_note
OpenAI Symphony
Elixir-based orchestration, open-sourced March 2026
What It Is
OpenAI Symphony is an open-source orchestration framework released in March 2026, built in Elixir for its concurrency strengths. It polls project management tools like Linear for new issues, decomposes them into sub-tasks, and dispatches Codex agents to work on each sub-task in parallel. Results are assembled, reviewed, and merged automatically.
Architecture
// Symphony architecture Intake: Polls Linear for new issues Planner: Decomposes into sub-tasks Dispatcher: Assigns to Codex agents Workers: Parallel agent execution Assembler: Combines results Reviewer: Automated quality check Merger: Creates and merges PRs
Key insight: Symphony represents the shift from “developer uses AI tool” to “system dispatches AI agents.” The human writes the issue; the system handles everything else. The harness is the entire pipeline, not just a constraint file.
groups
Stripe Minions
1,000+ merged PRs per week from internal agents
The System
Stripe’s Minions is an internal agent system that generates and merges over 1,000 PRs per week. Each “minion” is a specialized agent with its own harness: constraint documents, tool access, review pipeline, and merge criteria. The system handles everything from bug fixes to dependency updates to documentation improvements.
Scale Challenges
At 1,000+ PRs/week, traditional human review is impossible. Stripe uses tiered review: automated checks handle routine PRs (dependency updates, formatting), agent review handles medium-complexity PRs, and human review is reserved for high-risk changes (payment logic, security). The harness determines which tier each PR enters.
Why it matters: Stripe demonstrates that agent-generated code at scale is not theoretical — it’s production reality. The key enabler is not the model; it’s the orchestration and review harness that makes 1,000+ weekly PRs manageable.
rocket_launch
Basis: Zero Human Code
45-person startup, $200M revenue, entirely agent-generated code
The Story
Basis is a 45-person startup generating $200M in annual revenue with a codebase that contains zero human-written code. Every line was generated by AI agents operating within a comprehensive harness. Engineers at Basis don’t write code — they design and maintain the harness that guides agents to write correct code.
What Makes It Work
Basis’s success depends on an extremely mature harness: comprehensive constraint documents, extensive structural tests, multi-layer review pipelines, episodic memory from thousands of past tasks, and continuous entropy management. The harness is the product — the code is just its output.
Key insight: Basis proves that “model is commodity, harness is moat” is not just a slogan. Their competitive advantage is entirely in the harness. Any competitor could use the same models, but they’d need to build an equivalent harness to match Basis’s output quality.
stacked_bar_chart
The Four Maturity Levels
From ad-hoc to autonomous
Maturity Model
Level 1: Ad-Hoc Developer uses AI as autocomplete No constraint documents No review pipeline Manual everything Level 2: Guided CLAUDE.md / AGENTS.md in place Basic linting rules Human reviews all agent output Single-agent tasks Level 3: Systematic 3-tier constraint architecture Multi-agent review pipelines Episodic memory and learning CI/CD integration Level 4: Autonomous Full orchestration (Symphony-like) Self-improving harness Entropy management Human oversight, not review
Where Most Teams Are
Most teams in early 2026 are at Level 1 or 2. They use AI tools but haven’t invested in systematic harness engineering. The jump from Level 2 to Level 3 is the highest-impact investment — it’s where agent reliability goes from “sometimes useful” to “consistently productive.”
Key insight: You don’t need to reach Level 4 to get value. Level 3 (systematic constraints, review pipelines, memory) is where most of the ROI lives. Level 4 is for organizations where agents are the primary code producers.
warning
OpenAI’s 1M-Line Experiment
5 months, 3 engineers, zero human-written code
The Experiment
OpenAI built an internal codebase of over 1 million lines over 5 months with just 3 engineers and zero human-written code. The engineers designed and maintained the harness; agents wrote all the code. This was both a proof of concept and a stress test of harness engineering at extreme scale.
Lessons Learned
Constraint documents evolve rapidly at this velocity — weekly updates were necessary.

Entropy management is critical — without cleanup agents, the codebase degraded within weeks.

Human oversight shifts from reviewing code to reviewing the harness itself.

The bottleneck is harness quality, not model capability.
Critical in AI: At 1M lines with zero human code, every bug is a harness failure. The experiment proved that harness engineering is not just about improving agent output — it’s about making agent output trustworthy enough to deploy without human code review.
hub
Multi-Agent Coordination
When agents need to work together
Coordination Patterns
Sequential: Agent A completes, passes output to Agent B. Simple but slow.

Parallel: Multiple agents work on independent sub-tasks simultaneously. Fast but requires task decomposition.

Hierarchical: A lead agent plans and delegates to worker agents. Good for complex tasks.

Collaborative: Agents share a workspace and coordinate through shared state. Most complex but most flexible.
Conflict Resolution
When parallel agents modify the same files, conflicts arise. Resolution strategies: file-level locking (only one agent per file), merge-based (agents work on branches, merge conflicts resolved automatically or by a coordinator agent), or task decomposition (ensure sub-tasks don’t overlap at the file level).
Rule of thumb: Start with sequential orchestration. Move to parallel only when you have reliable task decomposition. Collaborative multi-agent is the most powerful but requires the most sophisticated harness.
monitoring
Monitoring Agent Fleets
Observability for systems running dozens of agents
What to Monitor
Task completion rate: What percentage of dispatched tasks complete successfully?

Time to completion: How long do tasks take? Are some stuck?

Cost per task: Token consumption and API costs per completed task.

Error rate by type: Which failure modes are most common?

Review pass rate: What percentage pass automated review on first attempt?
Alerting
Stuck agents: Alert when an agent hasn’t produced output in N minutes.

Cost spikes: Alert when a single task exceeds the cost budget (likely a doom loop).

Error rate spikes: Alert when the failure rate exceeds the baseline (likely a harness or model issue).

Merge conflicts: Alert when parallel agents create conflicting changes.
Key insight: At scale, you’re not monitoring individual agents — you’re monitoring the fleet. The same observability principles that apply to microservices apply to agent fleets: health checks, metrics, alerting, and dashboards.
trending_up
Building Toward Scale
Practical steps from single-agent to fleet
The Path
Phase 1: Single agent, single task. Master the harness fundamentals: constraints, linting, review.

Phase 2: Single agent, multiple tasks. Add episodic memory and task-specific skills.

Phase 3: Multiple agents, parallel tasks. Add orchestration, conflict resolution, fleet monitoring.

Phase 4: Autonomous pipeline. Add self-improving harness, entropy management, human oversight (not review).
When to Scale
Scale when your single-agent harness is reliable. If your single agent has a >70% first-pass approval rate and your review pipeline catches the rest, you’re ready to add more agents. If your single agent still needs heavy human correction, adding more agents just multiplies the problems.
Key insight: Orchestration at scale is not about running more agents. It’s about having a harness reliable enough that running more agents produces more value, not more problems. Master the harness first, then scale.