AI Harness Engineering: Why the Wrapper Matters More Than the Model

In February 2026 OpenAI published Harness Engineering, a short post that gave a name to something many of us had been doing without one. The framing was simple:

Agent = Model + Harness

The model is the LLM doing the reasoning. The harness is everything else — tool management, retries, context windowing, sub-agent control, approval gates, the architectural constraints that keep an agent from drifting off the rails.

I have been building agentic systems in production for the last eighteen months. The pattern I keep seeing is that two teams using the same model will get wildly different results, and the difference is almost always the harness. This site is where I will write about that gap.

The argument in three pieces of evidence

LangChain ran an experiment where they held the model fixed (GPT-5.2-Codex) and only changed the surrounding scaffolding. Terminal Bench 2.0 went from 52.8% to 66.5% — a 13.7-point jump from harness changes alone. They were not picking a smarter model. They were giving the same model better tools, better feedback loops, and a pre-completion checklist that intercepted premature “I’m done” signals.

Manus, an agentic system that hit $100M ARR in eight months, was acquired by Meta in late 2025 for around two billion dollars. The team has said publicly they rewrote the harness five times in six months without changing the underlying model, and reliability climbed each time. Meta did not buy a model. They bought a wrapper.

OpenAI’s own internal Codex effort had three engineers ship roughly a million lines of production code over five months, more than fifteen hundred pull requests, with no hand-written code. Their post-mortem credits the harness — context engineering, architectural constraints, an entropy-management agent that prunes drift — not the model.

Three independent data points. Same conclusion: the model is necessary but the harness is what separates a demo from a production system.

What this site will cover

The plan is to write in two cadences:

Long-form posts (monthly, on /posts). Synthesis pieces. How a specific harness primitive works, why it works, where it breaks, what to use instead. The first ones I want to write are on Claude Code’s hook architecture, the dual-agent pattern Anthropic published in late 2025, and the failure modes I have run into when sub-agents are misused.

TIL (a few times a week, on /til). Short notes. A bug I hit, a behavior of an API I did not expect, a config that turned out to matter. Modeled on Simon Willison’s /til/, because the lab notebook is more useful when it is public.

If you care about getting agents to actually work in production — not just to demo well — this is for you.

What this site will not cover

Prompt engineering as a standalone topic. Chat UI design. Generic LLM benchmarks. AI hot takes. There is plenty of writing on those elsewhere; the bar for me adding to that pile is high.

I will probably get some of this wrong. The space is moving fast and parts of what I write here will look dated in a year. The bet is that the underlying engineering questions — how do you keep a long-running stochastic process on track, how do you observe it, how do you recover when it drifts — will keep mattering for longer than any specific model generation.