project · 2026-05-29 · building

Ralph

#agents #coding #harnesses

Summary

Ralph is my coding agent. Mention it in Slack, a Linear issue, or a GitHub PR comment, and it spins up an isolated cloud sandbox, clones the repo, plans a change, writes and tests it, and opens a pull request. It is a fork of LangChain's open-swe with a handful of changes I cared enough about to build and maintain. The name is a homage to Geoffrey Huntley's Ralph loop, which I started from and have since outgrown.

Why build on open-swe

I did not want to write the boring, load-bearing parts again. open-swe already solves them: a fresh isolated sandbox per thread, triggers from Slack, Linear, and GitHub, repo credentials injected through a proxy so no token sits on disk, and a LangGraph and Deep Agents loop that produces a pull request at the end. The whole agent is assembled in a single get_agent() function, so customizing it means swapping tools and middleware rather than forking a runtime.

The bigger reason is the upgrade path. I keep a ledger that classifies every place I diverge from upstream, so I can pull LangChain's improvements instead of stranding a dead fork. That discipline already paid off. When upstream shipped its own post-hoc reviewer flow, I deleted the custom version I had written, because I trust their maintained path more than my one-off. A fork that cannot take updates is a liability, and most of my changes are designed to sit cleanly on top of someone else's work.

What I changed, and why

The manager delegates the coding to Claude Code. The model running the loop is small and cheap. It scopes the task, hands the implementation to Claude Code inside the sandbox, then checks the result. I am not going to out-engineer Anthropic's coding harness, and I do not need to (Bonus: getting to use my Claude subscription instead of API costs). The manager's job is judgment and verification, not keystrokes. The delegated agent runs with permission prompts skipped, because the sandbox is the security boundary and nothing it touches reaches the host. It carries a quality contract that tells it to verify before reporting done and to match the spec exactly. Lint hooks block it from finishing on any error it newly introduced. And the manager does not take the subagent's word for what happened. It reads the git diff as ground truth and compares that against what the subagent claims it changed. Sessions persist per thread, so a follow-up message continues the same collaboration instead of starting cold.

I own the context management. Deep Agents ships a summarizer, and I turned it off. Running it alongside my own meant two summarizers firing against the same window and fighting over it. Mine caps the model's input below its hard limit so summarization triggers early instead of at the edge, keeps the most recent turns intact, condenses the rest, and offloads bulky tool output to disk where the agent can fetch it back.

An advisor tool for hard calls. When the agent hits a decision it is unsure about, it can ask for a second opinion in a clean context. It reuses the same model with a plain chat call rather than the provider's native advisor feature, because the endpoint I run the manager on does not implement that server side. A small tax for running a non-Anthropic model as the manager.

Web servers get run and tested. A pair of tools boots a dev server, polls it until it answers, auto-installs missing dependencies and retries, frees a held port and retries, then hands a headless browser the running URL to check the frontend. For anything with a UI, a clean compile proves almost nothing so the agent verifies UI with Vercel's agent-browser.

I trimmed the toolset. I cut several inherited tools: todo lists, ls, glob, grep, and write_file. Search runs through the shell with ripgrep and git grep, which the model already knows. Creating new files is real implementation work, so it belongs to Claude Code, not the manager. I kept edit_file for cheap one-line fixups and read_file because the skills system depends on it. The result is a lean manager that delegates the heavy work instead of doing a little of everything.

Status

The core runs: a sandbox per task, delegation with diff and lint verification, the compaction and advisor middleware, dev-server testing, the reviewer, and pull-request output, backed by a few hundred tests.

One decision is worth recording because I reversed it. I built a planner and an evaluator that scored a change against an acceptance contract before submitting. I removed it. Reasoning directly from the task and verifying after the fact beat a heavy plan I had to keep in sync, especially once upstream's reviewer covered the same ground. The piece still missing is server-side scheduling, so Ralph can pick up work on a timer instead of waiting to be called.