writing · 2026-05-31

Your Agent Context Is Trapped in JSONL

#agents #context #knowledge

You spend a Tuesday in Claude Code. By the time you close the laptop, the disk holds a full account of the day: the feature you built, the test that kept failing, the migration you gave up on at four in the afternoon, the line you left yourself about what to try tomorrow. Open the agent the next morning and it knows none of it. Clean window. It re-reads the project structure you already explained, re-proposes the migration that failed, and asks the question you answered yesterday.

Intelligence is not the bottleneck anymore. A current model plans well enough to carry a four-hour task. The bottleneck is context, and the strange part is where it sits: on your own disk, the whole time, in more detail than you would ever write down on purpose.

Picture the version that works. The agent reads what you did Tuesday night, sees the half-built feature and the note about tomorrow, and starts. You wake up to a running MVP of the thing you sketched out, or to a fix for the bug you left open, with a diff and a short account of what changed waiting for review. The context to do that already exists. What's missing is the read path back into it.

What's already on disk

Walk through what one afternoon of agent work leaves behind. The harness keeps an append-only transcript of the whole conversation, one JSON object per line, so it survives a crash and replays in order. Under it sits session state in SQLite: which task is active, which files the agent edited, which tools are registered. If the agent drove a browser to check a page, there's a profile directory and a folder of screenshots. The repo carries the diffs and the new files. Run the tests and you get pass-fail records with stack traces. Memory files hold the notes the agent wrote for itself across sessions. None of this is exotic. It is the normal exhaust of an agent doing its job.

Each store holds something the next run wants. The transcript records what the agent tried and in what order. The test log records which command failed and why. The diff records what changed, set against what the agent claimed it changed. The screenshots show what the page looked like before the regression. The value is real and it is local. The catch: these stores speak different formats, share no identifiers, and offer no common surface to query. You cannot ask "what did we try for the auth bug and what happened," because the answer is smeared across four files in three formats with no key to join them.

The standard we're missing

The value here is high enough that you would expect this solved already. An agent that reads your own work back to you, picks up the half-finished thread, and acts on it overnight is worth more than another point on a benchmark. It does not exist because every harness writes its stores in its own shape and no two agree. Claude Code's transcript schema is not Codex's, and neither one holds still between releases. Your context sits on disk in a dozen private formats, and nothing was built to read across them.

The industry settled this once already, for instructions. AGENTS.md became the agreed file an agent reads to learn how to work in your repo, and most tools honor it. (Anthropic still ships CLAUDE.md, but a format won and the ecosystem lined up behind it.) Memory has no equivalent. No agreed schema for what a session records, no way for one tool to read another tool's trace, no settled place the durable record of your work lives. Agree on that the way the industry agreed on AGENTS.md, and your context stops being trapped in whichever harness happened to write it.

Until then, the read path falls to you. The harness already writes the stores; the missing work is joining them and tracking where each fact came from. Give the records shared keys and a failed test, the command that produced it, and the diff that fixed it all hang off one session instead of scattering across three files. Keep each claim one hop from its evidence and the layer stays correctable: read that the V2 migration was abandoned, and the transcript line that abandoned it sits right behind the claim. Do that and the next run can ask what you tried and what failed last time instead of rediscovering it cold.

What it buys you

Build the read path and the agent stops starting cold. Play that forward.

The smallest win is a morning brief. Rather than re-derive your project from nothing, the agent opens with where you left off, what is unfinished, and what you said you would try next, assembled from yesterday's transcript and diffs and the note you wrote at the end of the day. A background agent that reads structured work state can hand you that brief before you sit down, which is the version of this I keep sketching in my own notes.

A larger win runs while you sleep. You write out an idea in a few messages and close the laptop. The agent reads the sketch, sees nothing built yet, and stands up an MVP overnight. You wake up to a running prototype and a diff to read.

The win I want most catches your mistakes. The agent reads back over the day, finds the thing you left half-done or the test you wrote that passes for the wrong reason, and fixes it, with the change staged and a note on why it touched what it touched. You review and approve. The agent earns the right to work overnight by showing its evidence, the way a good review queue shows you what changed and why before you sign off, not by being trusted blind.

The same readable trace changes the economics. Break a run into what each step cost and you can see where the tokens went: the context the model reread for nothing, the tool output it never used, the loop that burned ten thousand tokens to end where it started. A token budget stops being a number you tune blind and becomes something you audit against the trace, line by line, before you pay for the same waste again tomorrow.

It's starting to converge

The pieces are starting to line up, from three directions at once.

LangChain has argued for years that the trace is the raw material of agent work, the record you debug and learn from, and it now pipes Claude Code sessions straight into LangSmith: every user message, tool call, compaction step, and subagent run, grouped by thread.

OpenAI shipped Euphony, the first time a frontier lab treated the pile of session JSONL as something worth naming. It reads Harmony conversations and Codex session logs and renders them as browsable timelines, with filtering and an in-browser JSONL editor, instead of leaving you to scroll raw JSON. A viewer is not a memory layer, but a lab shipping a tool that says your agent sessions are structured data you should be able to read is the acknowledgment this space was waiting for.

Open source is moving fastest. Tracebase imports the JSONL transcripts Claude and Codex already write, encrypts them at rest, builds a local searchable index, and serves a localhost dashboard that flags context waste, repeated commands, and loops. That is the read path, built over the stores you already have, on your own machine.

The part that stays hard

Storing the context and agreeing on a shape for it is a coordination problem, and the projects above show it starting to give. The layer you build on top is the part that stays hard, and it carries a name: the no-escape theorem for semantic memory. Search your memory by meaning and it crowds itself as it grows, since the more you store, the more similar records blur together and the right one stops surfacing. Store exact records to dodge that and you lose the semantic reach. No single design hands you faithful recall, speed, flexibility, and control at once, so you pick a point on the frontier.

The local case picks a good point for you. The raw stores are already exact episodic records on disk, so keep them as the source of truth. Use semantic retrieval to navigate back into them, never as the ledger itself, and ground every answer in a transcript line or a diff you can open. The pattern I trust keeps the files as the episodic record and runs the model as a semantic layer over them. For coding work the bias should lean further toward the raw files than people assume: plain text search over the actual transcripts and diffs holds up against embeddings more often than you would expect, because the stores are the thing you want, not a lossy summary of it.

KnowLedger

This is what I am exploring with KnowLedger. Capture the context-rich artifacts an agent leaves behind, derive structure from the mess after the fact, keep the links back to source evidence, and expose a layer the next run can query. Raw conversations come in as first-class intake. Compilation turns them into something you can navigate. The goal is a memory the agent can interrogate about its own past work and yours.

The harness is what writes all of this. If it never reads, normalizes, and consolidates the stores it produces, the context rots in place no matter how rich it was the moment it landed. That read path is harness work, the same kind of decision that settles whether an agent does useful work at all. The context was never the missing piece. You generate it every day you sit down, in finer detail than you could record by hand. What is missing is the shared format, the keys, the read path that turns that record into something the next agent can use. Some of that is coming from the labs. The rest I am exploring, so the agent can use the day you already had.