Ch 1: What Is Harness Engineering?

pets

The Horse-Tack Metaphor

Why powerful models need harnesses

The Metaphor

A horse is powerful but unpredictable. A harness doesn’t make the horse stronger — it channels the horse’s power productively. Reins provide direction. Blinders prevent distraction. The bit provides fine-grained control. Without tack, even the best horse is unreliable. The same is true for AI coding agents.

The AI Parallel

LLMs are powerful but unpredictable. They can write brilliant code or introduce subtle bugs. They can follow architectural patterns or ignore them. Harness engineering is the discipline of building the systems that channel this power productively — constraint documents, linters, review pipelines, memory systems, and orchestration layers that make agents reliable.

Key insight: The harness doesn’t replace the model’s intelligence. It creates the environment where that intelligence is applied consistently, safely, and in alignment with the team’s standards.

history_edu

Origin & Naming

How the discipline emerged in early 2026

Elvis Saravia’s Coining

The term “harness engineering” was coined by Elvis Saravia in early 2026, building on the horse-tack analogy. Saravia defined it as the software engineering discipline focused on building systems and infrastructure to control AI coding agents. The term quickly gained traction because it captured what practitioners were already doing but lacked a name for.

Martin Fowler’s Description

Martin Fowler described harness engineering as the practice of building “the surrounding system that makes AI agents useful in production.” His endorsement signaled that this wasn’t just a trend — it was becoming a recognized engineering discipline with its own principles, patterns, and best practices.

Why it matters: Before “harness engineering” had a name, teams were independently inventing the same patterns: constraint files, review loops, linting rules. Naming the discipline accelerated knowledge sharing and standardization.

category

The Four Aspects

Constrain, Inform, Verify, Correct

Formal Definition

Harness engineering has four formal aspects, each addressing a different failure mode of AI agents:

Constrain: Limit what the agent can do. Architectural rules, dependency boundaries, forbidden patterns, file access restrictions.

Inform: Give the agent the right context. Constraint documents, skill files, codebase conventions, few-shot examples.

Verify: Check the agent’s output before it ships. Linters, type checkers, tests, multi-agent review, human review gates.

Correct: Fix problems automatically when possible. Self-healing loops, auto-formatting, error retry with feedback, entropy cleanup agents.

Key insight: Most teams start with Inform (writing CLAUDE.md) and skip the other three. A complete harness addresses all four aspects. The order matters: constrain first (prevent bad outputs), then inform (guide good outputs), then verify (catch what slipped through), then correct (fix automatically).

precision_manufacturing

What a Harness Contains

The concrete components of a production harness

Core Components

// A production AI agent harness Constraint Documents CLAUDE.md / AGENTS.md / .cursorrules Task-specific skill files Deep reference guides Enforcement Layer Custom linting rules Architectural boundary tests Pre-commit hooks Type checking Review Pipeline Multi-agent review workflows Self-verification loops Human review gates CI/CD integration Memory & Learning Episodic memory (few-shot examples) Error pattern tracking Documentation consistency agents Orchestration Task routing and dispatch Agent coordination Progress monitoring

The Spectrum

Harnesses range from minimal (a single CLAUDE.md file) to comprehensive (full orchestration with dozens of agents, custom linters, and automated review). Most teams start minimal and add components as they encounter specific failure modes. The key is to add each component in response to a real problem, not speculatively.

Critical in AI: A harness is not a one-time setup. It’s a living system that evolves as you discover new failure modes. The best harnesses are built iteratively: deploy the agent, observe failures, add constraints, repeat.

trending_up

Model Is Commodity, Harness Is Moat

Why the surrounding system matters more than the model

The Argument

Models are converging in capability. GPT-4, Claude, Gemini, and open-source alternatives all perform within a narrow band on most coding benchmarks. The differentiator is not which model you use — it’s the system you build around it. Two teams using the same model will get dramatically different results based on their harness quality.

The Evidence

LangChain improved from 52.8% to 66.5% on Terminal Bench 2.0 by changing only the harness, not the model. Nate B Jones demonstrated 78% vs 42% on the same benchmark with the same model by improving the harness. Google DeepMind’s AutoHarness paper showed small models with good harnesses outperforming larger models without them.

Key insight: If you’re spending time evaluating which model to use, you’re optimizing the wrong variable. The harness has 2–3× more impact on agent performance than the model choice. Invest in the harness first.

compare

Harness vs. Context Engineering

Complementary disciplines, not competing ones

The Relationship

Context engineering controls what the model sees — the information in its context window. Harness engineering controls the entire environment the agent operates in — constraints, verification, correction, and orchestration. Context engineering is a subset of harness engineering. A harness includes context management plus everything else.

Scope Comparison

Context Engineering System prompts, tool schemas RAG and retrieval Compression and routing Token budgeting Harness Engineering (superset) All of context engineering, plus: Constraint documents Linting and architectural rules Review pipelines Memory and learning Orchestration at scale Governance and security

school

The New Engineering Role

What harness engineers actually do

The Shift

Harness engineering represents a fundamental shift in what software engineers do. Instead of writing application code directly, harness engineers design the systems that guide AI agents to write correct code. The skill set shifts from “how to implement feature X” to “how to constrain and guide an agent to implement feature X correctly.”

Key Skills

Constraint design: Writing effective CLAUDE.md files and architectural rules.

Failure mode analysis: Identifying how agents fail and designing guardrails.

Pipeline design: Building review and verification workflows.

Observability: Monitoring agent behavior and detecting drift.

Iterative improvement: Continuously refining the harness based on observed failures.

Key insight: The best harness engineers are experienced software engineers who understand both the codebase deeply and the failure modes of AI agents. They don’t replace traditional engineering skills — they build on them.

rocket_launch

The State of the Art (2026)

Where the industry stands today

Adoption

By early 2026, harness engineering has moved from experimental to mainstream. OpenAI built an internal 1M-line codebase over 5 months with 3 engineers and zero human-written code — entirely through agent harnesses. Stripe’s Minions system merges 1,000+ PRs per week. Basis, a 45-person startup, generates $200M in revenue with zero human-written code.

Industry Convergence

The major platforms have converged on similar harness patterns: Anthropic (CLAUDE.md), Cursor (.cursorrules, AGENTS.md), OpenAI (Codex with constraint files), Google (Jules with project rules). The specific file names differ, but the underlying pattern — structured constraint documents that guide agent behavior — is universal.

Key insight: Harness engineering is not a trend — it’s the inevitable consequence of AI agents becoming capable enough to write production code. Every team using AI agents is doing harness engineering, whether they call it that or not. The question is whether they’re doing it well.

Ch 1 — What Is Harness Engineering?