writing · 2026-05-28

The Agent Is the Harness

#agents #harnesses #software

A model takes text in and produces text out. That is the whole contract. It holds no state between calls, runs no code, sees nothing past its knowledge cutoff, and forgets the conversation the moment the window closes. None of the behavior we mean by "agent" lives in the weights. It lives in the code wrapped around them: the loop that tracks messages, the tools that let it act, the files that survive a reset, the rules that decide when it stops. That wrapper is the harness, and it decides whether the work succeeds.

LangChain's Vivek Trivedy states it cleanly: agent equals model plus harness. If you are not the model, you are the harness. System prompts, tool schemas, sandbox configuration, compaction logic, the while loop itself, all of it is harness code, designed on purpose or not. The simplest chatbot proves the point. The moment you keep a list of prior messages and append the next one, you have built a primitive harness. The only question is whether you build it deliberately.

Model quality stops being the lever

Choosing a model used to absorb most of the attention. Benchmarks, latency, cost per token, this provider against that one. That work pays off until the model is good enough, then the wins move elsewhere. LangChain reported taking its own coding agent from outside the top thirty to the top five on Terminal-Bench 2.0 by changing only the harness, same weights underneath. The leaderboard itself makes the weaker version of the point: it ranks model-plus-agent combinations, not bare models, and one model swings ten to twenty points depending on the scaffold around it.

This holds because the failures that kill long-running agent work are structural, not a shortfall of intelligence. An agent does not botch a four-hour refactor because it cannot reason. It botches it because the context window filled around step forty and the model started losing the thread, a degradation Chroma's researchers named context rot. Practitioners running these agents all day report the same recurring failure shapes: context anxiety, where the agent grows desperate to end a long session and declares victory early; plan drift, where it quietly does a near-neighbor of the task and cascades from there; verification laziness, where it writes a weak test for the wrong thing and watches it pass. A framework executes each tool call correctly and still permits every one of these. They are harness decisions, and a smarter model does not make them for you.

There is direct evidence the harness matters as much as the weights. Claude Code and Codex are post-trained with their harnesses in the loop, not as a bare model handed a generic wrapper afterward. Change the tool logic, and performance drops even though the model is identical. The intelligence and the scaffolding co-evolved. You cannot cleanly separate them.

The primitives that carry the work

Working backward from what we want an agent to do, the same components keep appearing.

The filesystem is the foundational one. It gives the agent a workspace, a place to offload anything that does not fit in context, and a record that outlasts a single session. It is also the collaboration surface: two agents, or an agent and a person, coordinate through shared files instead of a shared context window. Git layers versioning on top, so work can be branched and rolled back, and a fresh agent can read the history to learn what already happened.

Bash and code execution turn a fixed tool list into an open one. Pre-built tools cover the cases the designer anticipated; a shell covers the rest, because the agent writes the missing tool on the fly. Sandboxes decide where that execution lands. Running model-generated code on your own machine is a security hole and a scaling dead end, so the harness routes it into an isolated environment with allow-listed commands, pre-installed runtimes, a browser for checking output, and a test runner.

Memory and search supply the knowledge the weights lack. A standing memory file injected at startup carries lessons across sessions, a thin form of continual learning. Web search and tools like Context7, Upstash's up-to-date library-docs server, reach past the knowledge cutoff for current APIs and live data.

Context management fights the rot. As the window fills and the model degrades, the harness compacts: it summarizes and offloads before the API errors or the earliest messages silently drop. It truncates noisy tool output to head and tail and keeps the full result on disk for when the model wants it. It loads a skill's detailed instructions only when a task calls for them, instead of paying the context tax up front.

Orchestration and verification turn the loop into a workflow. Subagents split the work into planner, generator, and reviewer roles. Stop conditions read actual conversation state rather than a disconnected step counter. A fresh-context reviewer catches what the primary worker misses, because the worker has degraded under accumulated pressure while the reviewer reads clean. Verification needs an explicit contract and often its own agent, or the work optimizes for whatever test passes cheapest.

Ralph, concretely

Ralph is my own coding agent, and the name is a homage. Geoffrey Huntley named the original pattern in 2025: the Ralph loop, while :; do cat PROMPT.md | claude-code; done. Run the agent in a bare loop, hand it fresh context every pass, let it resume from state on disk. When models were weaker and lost the thread halfway through a task, throwing away the window each iteration was a breakthrough. It traded continuity for reliability, and the trade paid.

Better models changed the math. A current model holds a long task well enough that you no longer wipe its memory to keep it coherent. So Ralph stopped being a loop and became an agent with deliberate context management, built on LangChain's open-swe and its LangGraph and Deep Agents foundation. It works the task directly: understand, delegate, verify, submit. A few additions are mine.

Compaction is where the old fresh-context trick lives on in a sharper form. Rather than wipe the window, the harness caps the model's input below its hard limit so summarization fires early, condenses the older turns, offloads bulky tool output to disk, and keeps the most recent turns whole. It also turns off the framework's built-in summarizer, so two compactors are not fighting over the same context.

An advisor tool gives the agent a second opinion on hard calls. When it hits a decision it is unsure about, it asks the same model again in a clean context, free of the clutter that piled up while it worked. A fresh read catches what the cluttered one talked itself into.

Ralph can also orchestrate Claude Code. The honest reason: I want to build the best harness I can, and I do not always have time to perfect it. Anthropic does. So a smaller, cheaper model runs the loop and hands the heavy implementation to Claude Code inside the sandbox, acting as the manager that steers and checks instead of typing every line. It does not trust the report it gets back. It diffs the repo to see what changed and runs a lint gate that blocks any newly introduced error. A small agent managing a larger one, with verification it controls. Building my own harness taught me when to lean on someone else's. That is a harness decision too.

The sandbox runs and tests web servers. It boots a dev server, waits until the server answers instead of assuming it came up, and drives a headless browser to confirm the frontend and endpoints respond. That is the gap between "the code compiles" and "the thing works."

Will the models absorb it?

The obvious objection is Rich Sutton's bitter lesson: general methods that ride compute beat hand-built structure every time, so today's harness scaffolding is tomorrow's obsolete crutch. Models will plan better, verify better, and hold coherence longer. Some of what the harness does now will get swallowed by the weights. Ralph is my own example. The fresh-context loop it was named for is already dead weight on a current model, which holds a task long enough that wiping the window each pass only loses ground.

The harness did not vanish when that trick stopped being needed. It moved up a level, into compaction, a second-opinion tool, and delegation. That is the pattern I expect to repeat. Even a model that plans and verifies perfectly still has no filesystem, no sandbox, no git history, and no permission to run a command on your machine. Those are environment, not cognition, and someone has to build and secure them. The trend so far runs that way: memory, durable state, and tool access are migrating up into the harness, not down into the model. Sarah Wooders puts the memory case sharply, that asking to plug memory into an agent is like asking to plug driving into a car. Prompt engineering still earns its keep years after people called its death, and harness engineering sits one layer down at the same load-bearing spot.

The title is a provocation, and I mean it as one. The model supplies the intelligence. Whether that intelligence becomes shipped, correct, long-running work is settled by the code wrapped around it. Build the wrapper on purpose.

References

Vivek Trivedy, "The Anatomy of an Agent Harness", LangChain, 2026.
"Terminal-Bench leaderboard", tbench.ai.
Chroma Research, "Context Rot".
Geoffrey Huntley, "Ralph Wiggum as a software engineer", 2025.
Upstash, "Context7".
Richard Sutton, "The Bitter Lesson", 2019.