Ch 7: Orchestration at Scale — Harness Engineering

Ch 7 — Orchestration at Scale

Production systems running hundreds of agents: Symphony, Minions, and the maturity model

Index

High Level

task

Intake

arrow_forward

route

Dispatch

arrow_forward

group

Agents

arrow_forward

rate_review

Review

arrow_forward

merge

Merge

arrow_forward

monitoring

Monitor

Click play or press Space to begin...

Step- / 8

music_note

OpenAI Symphony

Elixir-based orchestration, open-sourced March 2026

What It Is

OpenAI Symphony is an open-source orchestration framework released in March 2026, built in Elixir for its concurrency strengths. It polls project management tools like Linear for new issues, decomposes them into sub-tasks, and dispatches Codex agents to work on each sub-task in parallel. Results are assembled, reviewed, and merged automatically.

Architecture

// Symphony architecture Intake: Polls Linear for new issues Planner: Decomposes into sub-tasks Dispatcher: Assigns to Codex agents Workers: Parallel agent execution Assembler: Combines results Reviewer: Automated quality check Merger: Creates and merges PRs

Key insight: Symphony represents the shift from “developer uses AI tool” to “system dispatches AI agents.” The human writes the issue; the system handles everything else. The harness is the entire pipeline, not just a constraint file.

groups

Stripe Minions

1,000+ merged PRs per week from internal agents

The System

Stripe’s Minions is an internal agent system that generates and merges over 1,000 PRs per week. Each “minion” is a specialized agent with its own harness: constraint documents, tool access, review pipeline, and merge criteria. The system handles everything from bug fixes to dependency updates to documentation improvements.

Scale Challenges

At 1,000+ PRs/week, traditional human review is impossible. Stripe uses tiered review: automated checks handle routine PRs (dependency updates, formatting), agent review handles medium-complexity PRs, and human review is reserved for high-risk changes (payment logic, security). The harness determines which tier each PR enters.

Why it matters: Stripe demonstrates that agent-generated code at scale is not theoretical — it’s production reality. The key enabler is not the model; it’s the orchestration and review harness that makes 1,000+ weekly PRs manageable.

rocket_launch

Basis: Zero Human Code

45-person startup, $200M revenue, entirely agent-generated code

The Story

Basis is a 45-person startup generating $200M in annual revenue with a codebase that contains zero human-written code. Every line was generated by AI agents operating within a comprehensive harness. Engineers at Basis don’t write code — they design and maintain the harness that guides agents to write correct code.

What Makes It Work

Basis’s success depends on an extremely mature harness: comprehensive constraint documents, extensive structural tests, multi-layer review pipelines, episodic memory from thousands of past tasks, and continuous entropy management. The harness is the product — the code is just its output.

Key insight: Basis proves that “model is commodity, harness is moat” is not just a slogan. Their competitive advantage is entirely in the harness. Any competitor could use the same models, but they’d need to build an equivalent harness to match Basis’s output quality.

stacked_bar_chart

The Four Maturity Levels

From ad-hoc to autonomous

Maturity Model

Level 1: Ad-Hoc Developer uses AI as autocomplete No constraint documents No review pipeline Manual everything Level 2: Guided CLAUDE.md / AGENTS.md in place Basic linting rules Human reviews all agent output Single-agent tasks Level 3: Systematic 3-tier constraint architecture Multi-agent review pipelines Episodic memory and learning CI/CD integration Level 4: Autonomous Full orchestration (Symphony-like) Self-improving harness Entropy management Human oversight, not review

Where Most Teams Are

Most teams in early 2026 are at Level 1 or 2. They use AI tools but haven’t invested in systematic harness engineering. The jump from Level 2 to Level 3 is the highest-impact investment — it’s where agent reliability goes from “sometimes useful” to “consistently productive.”

Key insight: You don’t need to reach Level 4 to get value. Level 3 (systematic constraints, review pipelines, memory) is where most of the ROI lives. Level 4 is for organizations where agents are the primary code producers.

warning

OpenAI’s 1M-Line Experiment

5 months, 3 engineers, zero human-written code

The Experiment

OpenAI built an internal codebase of over 1 million lines over 5 months with just 3 engineers and zero human-written code. The engineers designed and maintained the harness; agents wrote all the code. This was both a proof of concept and a stress test of harness engineering at extreme scale.

Lessons Learned

Constraint documents evolve rapidly at this velocity — weekly updates were necessary.

Entropy management is critical — without cleanup agents, the codebase degraded within weeks.

Human oversight shifts from reviewing code to reviewing the harness itself.

The bottleneck is harness quality, not model capability.

Critical in AI: At 1M lines with zero human code, every bug is a harness failure. The experiment proved that harness engineering is not just about improving agent output — it’s about making agent output trustworthy enough to deploy without human code review.

hub

Multi-Agent Coordination

When agents need to work together

Coordination Patterns

Sequential: Agent A completes, passes output to Agent B. Simple but slow.

Parallel: Multiple agents work on independent sub-tasks simultaneously. Fast but requires task decomposition.

Hierarchical: A lead agent plans and delegates to worker agents. Good for complex tasks.

Collaborative: Agents share a workspace and coordinate through shared state. Most complex but most flexible.

Conflict Resolution

When parallel agents modify the same files, conflicts arise. Resolution strategies: file-level locking (only one agent per file), merge-based (agents work on branches, merge conflicts resolved automatically or by a coordinator agent), or task decomposition (ensure sub-tasks don’t overlap at the file level).

Rule of thumb: Start with sequential orchestration. Move to parallel only when you have reliable task decomposition. Collaborative multi-agent is the most powerful but requires the most sophisticated harness.

monitoring

Monitoring Agent Fleets

Observability for systems running dozens of agents

What to Monitor

Task completion rate: What percentage of dispatched tasks complete successfully?

Time to completion: How long do tasks take? Are some stuck?

Cost per task: Token consumption and API costs per completed task.

Error rate by type: Which failure modes are most common?

Review pass rate: What percentage pass automated review on first attempt?

Alerting

Stuck agents: Alert when an agent hasn’t produced output in N minutes.

Cost spikes: Alert when a single task exceeds the cost budget (likely a doom loop).

Error rate spikes: Alert when the failure rate exceeds the baseline (likely a harness or model issue).

Merge conflicts: Alert when parallel agents create conflicting changes.

Key insight: At scale, you’re not monitoring individual agents — you’re monitoring the fleet. The same observability principles that apply to microservices apply to agent fleets: health checks, metrics, alerting, and dashboards.

trending_up

Building Toward Scale

Practical steps from single-agent to fleet

The Path

Phase 1: Single agent, single task. Master the harness fundamentals: constraints, linting, review.

Phase 2: Single agent, multiple tasks. Add episodic memory and task-specific skills.

Phase 3: Multiple agents, parallel tasks. Add orchestration, conflict resolution, fleet monitoring.

Phase 4: Autonomous pipeline. Add self-improving harness, entropy management, human oversight (not review).

When to Scale

Scale when your single-agent harness is reliable. If your single agent has a >70% first-pass approval rate and your review pipeline catches the rest, you’re ready to add more agents. If your single agent still needs heavy human correction, adding more agents just multiplies the problems.

Key insight: Orchestration at scale is not about running more agents. It’s about having a harness reliable enough that running more agents produces more value, not more problems. Master the harness first, then scale.

arrow_back Ch 6: Memory & Entropy Ch 8: Risks & Governance arrow_forward