Ch 2 — Background Coding Agents

Fire-and-forget task submission, sandboxed execution, and async PR delivery
High Level
description
Task
arrow_forward
cloud
Sandbox
arrow_forward
code
Code
arrow_forward
bug_report
Test
arrow_forward
rate_review
Self-Review
arrow_forward
merge
PR
-
Click play or press Space to begin...
Step- / 8
cloud_sync
The Core Pattern
Submit a task, walk away, come back to a pull request
How Background Agents Work
Every background coding agent follows the same fundamental loop: (1) Accept a task via CLI, chat, or API. (2) Spin up an isolated environment — a cloud sandbox, a container, or a VM. (3) Clone the repo, make changes, run tests. (4) Self-review and iterate until the task passes quality checks. (5) Open a PR for human review. The entire cycle happens without your IDE open. You submit the task and come back to a finished PR.
Why “Background” Matters
The word “background” is the key differentiator. Interactive agents (Level 2–3) run in your session — they stop when you stop. Background agents run independently. This decoupling is what enables parallelism: you can have multiple agents working on different tasks while you focus on architecture, reviews, or entirely different work.
Key insight: Background agents shift the developer’s role from “writing code” to “writing task descriptions and reviewing PRs.” The quality of your task description directly determines the quality of the output.
terminal
OpenAI Codex
Cloud sandboxes with GitHub integration
Architecture
Codex runs agents in cloud sandboxes — isolated environments with their own filesystem, terminal, and network access. Each task gets a fresh clone of your repo. The agent follows an iterative loop: plan, edit code, run tools (tests/build/lint), observe results, repair failures, and repeat. The Codex CLI supports non-interactive execution via the exec command, enabling automation and CI integration.
Distinctive Features
Conversation resumption — you can resume a prior session with new instructions, and the agent retains the original transcript, plan history, and approvals. GitHub Action integration — Codex can be triggered from CI workflows. Course corrections mid-flight — you can redirect the agent without resetting its progress.
CLI example: non-interactive execution
# Submit a task and let it run codex exec "Add pagination to the /users endpoint, update tests" # Resume a previous session with new instructions codex exec resume --last "Fix the race conditions you found"
desktop_windows
Devin 2.2
Full desktop environment with self-reviewing PRs
Architecture
Devin runs in a full Linux VM with desktop access — not just a terminal sandbox. It can launch desktop applications, interact with browsers, and test UIs visually. Devin 2.2 (released Feb 2026) introduced self-verification: the agent plans, codes, reviews its own output, catches issues, and fixes them before the PR is opened. It handles the full development loop independently, with 3x faster startup than previous versions.
Managed Devins
Devin can break down large tasks and delegate to multiple managed sessions working in parallel, each in its own isolated VM. A coordinator session monitors progress, resolves conflicts, and compiles results. This is multi-agent orchestration built into the product — you submit one task, and Devin decides how to parallelize it.
Key insight: Devin’s desktop access means it can test things other agents can’t — visual layouts, desktop app behavior, browser interactions. The trade-off is heavier resource usage per session.
account_tree
Claude Code Subagents
Parallel execution with task delegation
Architecture
Claude Code takes a different approach: instead of a separate cloud product, it extends the CLI with subagent spawning. You can move any task to the background with Ctrl+B and continue working in your main session. Multiple subagents run concurrently when tasks are independent — for example, running style-checking, security scanning, and test coverage analysis in parallel. Monitor all running agents with /tasks.
Routing Rules
You teach Claude when to use parallel vs sequential execution by adding routing rules to your CLAUDE.md file. The main agent reads these rules and makes automatic delegation decisions based on task dependencies and file boundaries. Subagents can be defined as markdown files in .claude/agents/ directories, giving you reusable, specialized agents for common workflows.
Key insight: Claude Code’s approach is “background within your local workflow” rather than a separate cloud service. This makes it easier to adopt incrementally — you start with one background task and scale up.
checklist
What to Delegate vs. What to Keep
The delegation decision framework
Good Background Tasks
Bug fixes with clear reproduction steps — the agent can run the repro, verify the fix, and confirm tests pass. Boilerplate and scaffolding — new CRUD endpoints, form components, migration files. Dependency updates — bump versions, fix breaking changes, run the test suite. Well-scoped features with clear acceptance criteria — “add a dark mode toggle that persists to localStorage.”
Keep Interactive
Architecture decisions — the agent can propose, but you need to evaluate trade-offs. Ambiguous requirements — if you can’t write a clear task description, the agent can’t execute it. Security-sensitive changes — auth flows, encryption, access control need human judgment. Cross-cutting refactors that touch many systems — the blast radius is too high for unsupervised work.
Key insight: The rule of thumb: if you could hand this task to a competent junior developer with a clear written spec and expect a good PR back, it’s a good background agent task. If you’d need to pair with them, keep it interactive.
architecture
The Blueprint Pattern
Combining deterministic scaffolding with flexible agent loops
The Concept
Stripe’s Minions system introduced the blueprint pattern: workflows defined in code that specify how tasks are divided into subtasks, with some handled deterministically (linting, formatting, boilerplate generation) and others handled by the agent (logic implementation, test writing). The blueprint acts as a recipe — it knows the steps, but the agent fills in the creative parts. This is how Stripe’s Minions produce over 1,300 PRs per week.
Why It Works
Blueprints reduce the agent’s decision space. Instead of figuring out the entire workflow from scratch, the agent only needs to handle the parts that require reasoning. The deterministic steps (clone repo, create branch, run linter, format code, run CI) are reliable and fast. The agent steps (understand the task, write the implementation, handle edge cases) get the full power of the LLM. This hybrid approach is more reliable than pure agent execution.
Key insight: The blueprint pattern is the most important architectural concept in this course. It’s the bridge between “one agent doing one task” and “an automated pipeline handling many tasks.” We’ll revisit it in Chapter 6.
warning
Current Limitations
Where background agents struggle today
Task Complexity Ceiling
Background agents work best on tasks that take a human 30 minutes to 2 hours. Below that, the overhead of writing a good task description isn’t worth it. Above that, the agent’s context window fills up, it loses track of the plan, and quality degrades. The sweet spot is well-scoped, single-PR tasks with clear success criteria.
The Review Bottleneck
Background agents can produce PRs faster than humans can review them. If you have 5 agents producing 5 PRs per hour, but your team can only review 3 PRs per hour, you’ve just moved the bottleneck from code writing to code review. The review process must scale with agent output — through better PR descriptions, automated pre-review checks, and structured review workflows.
Key insight: The biggest mistake teams make is scaling agent output without scaling review capacity. More agents without more review bandwidth just creates a PR queue that nobody looks at.
rocket_launch
Getting Started
The practical ramp-up path
Week 1–2: Single Agent, Low Risk
Pick one tool (Codex, Devin, or Claude Code subagents). Pick one low-risk task type — dependency updates or adding missing test coverage are ideal. Submit 3–5 tasks. Review every PR carefully. Measure: how long did the agent take? How much review effort was needed? What percentage of PRs were mergeable without significant changes?
Week 3–4: Expand Scope
If your first batch went well, expand to bug fixes and small features. Start running 2–3 agents in parallel. Develop a template for task descriptions that consistently produces good results. The template should include: what to change, acceptance criteria, which tests to run, and any constraints (don’t touch these files, follow this pattern).
Key insight: Treat your first month with background agents as an experiment, not a commitment. The goal is to learn what works for your codebase, your team, and your review process — not to maximize throughput on day one.