The chat window is not your system's state. That is the engineering mistake most multi-agent frameworks make, and it is the one I made myself for the first three weeks of building VNX.
Six months later, after 6,000+ receipts on disk and a dozen production crashes survived without losing a dispatch, I am convinced the right architecture is the one most teams skip: treat the append-only NDJSON receipt ledger as the canonical source of truth, and every other state file as a rebuildable cache.
This post is about why, what it actually buys you, and where the trade-offs show up.
The mistake: chat window as state
The naive multi-agent setup looks like this. Three Claude sessions, three panes, an orchestrator that reads scrollback to decide what to do next. State lives in the conversational history. If a worker crashes, you scroll up. If you forget what T2 was doing, you ask it.
This works for ten minutes. It does not work for ten hours.
Chat scrollback is unstructured. You cannot grep "all dispatches from agent T2 last Tuesday that touched the auth module." You cannot replay a sequence. You cannot prove what an agent did six weeks ago to satisfy an audit. You cannot recover from a crash because the crash erased the only record.
Git history shows what changed. Chat logs show some of what was discussed. Neither shows why an agent decided to change something, and that "why" is exactly what you need to debug, audit, or prove compliance.
The fix: append-only receipt ledger
Every dispatch lifecycle event in VNX writes a structured JSON line into .vnx-data/state/t0_receipts.ndjson. One line per event. Append-only. Crash-resilient by design, a partial write of one line cannot corrupt prior lines.
Each receipt carries:
dispatch_id, unique identifier joining the receipt to a manifestterminal, which worker (T0, T1, T2, T3)event_type, task_started, task_complete, task_failed, gate_decision, lease_acquired, lease_released, etc.status, pass, fail, pending, blockedprovenance.git_refandprovenance.is_dirty, which commit, was the worktree cleanmetadata.cost_estandmetadata.duration_sec, token cost and wall timeinstruction_sha256, cryptographic hash of the literal instruction text
After 6,000+ entries you can grep, jq, awk your AI agent's history natively. Three months ago you can answer "did Codex's gate fail more last week or this week?" with one pipe.
tail -f.vnx-data/state/t0_receipts.ndjson | jq -c '{dispatch_id, event_type, terminal, status}'That command is the entire observability stack. No agent, no SaaS, no APM.

Three streams, three jobs, zero corruption
A single ledger has a contention problem. Gate decisions, dispatch lifecycle events, and agent actions all want to write at once. The wrong move is to put them in one file with mutex locks. The right move is three streams, each scoped:
Stream 1, t0_receipts.ndjson (6,000+ entries today). Agent actions. Every dispatch lifecycle event a worker emits.
Stream 2, governance_audit.ndjson. System decisions. Every gate decision, dispatcher decision, escalation, or state mutation that crosses a governance boundary. Receipts answer "what did the agent do?" The audit log answers "what did the system decide about what the agent did?"
Stream 3, dispatch_register.ndjson. Canonical lifecycle. The single normalized timeline a dashboard can replay: dispatch_created, dispatch_promoted, dispatch_started, gate_passed, gate_failed, dispatch_completed, dispatch_rejected.
A malformed write to any one of three cannot corrupt the others. SSE streams in the dashboard read directly off these files (PR #304). Aggregators rebuild summary state from them on demand.
I wrote about the case for log-shaped state in detail, that is the next deep-dive in this series.
The mental model: every other state file is a cache
This is the part most teams resist. If receipts are canonical, then t0_state.json, pr_queue_state.json, the lease tables, all of those are derived state. You can rm them. They rebuild from receipts.
rm.vnx-data/state/t0_state.json
python3 scripts/build_t0_state.py
cat.vnx-data/state/t0_state.json | jq '.summary'scripts/build_t0_state.py is 974 lines of Python whose entire job is replaying the receipt ledger to rebuild the in-memory state. PR #275 added auto-rebuild triggered on state_mutation_receipt events. PR #287 expanded triggers to task_failed and task_timeout.
A SIGKILL'd dispatcher used to mean "did that dispatch land or not?" In the chat-window-as-state world, that question is unanswerable. In the receipt-ledger-as-state world, it is one replay away.
Crash recovery becomes free. That is the headline.
What you can answer once receipts are structured
Once every dispatch produces a structured receipt, you can ask questions you literally could not ask before:
- Which agent has the highest first-pass rework rate?
- Did codex_gate fail more last week or this week?
- Did Sonnet's average dispatch duration regress after the last model update?
- What is the cost-per-feature for the last three PRs?
- Which dispatch shapes correlate with high gate-failure rates?
These were not designed-in features. They emerged from having structured data. SPC (Statistical Process Control) charts on agent quality metrics, first-pass yield, rework rate, control limits, became possible because the receipts were already there.
A Karpathy-style dispatch parameter tracker (F57 in the VNX feature breakdown) does correlation analysis between dispatch shape and outcome. Not because someone planned it. Because the data was already on disk.
Cryptographic instruction provenance
Receipts get one more thing that turns out to matter: a SHA-256 hash of the literal instruction text given to the worker. PR #309 (feat(observability): instruction_sha256 stamp in manifest + receipt).
Why this matters: without it, you cannot verify post-hoc that the receipt corresponds to the instruction that ran. With it, you can prove "this commit was produced from exactly this prompt" without trusting the worker's self-report.
Useful for compliance review. Useful for forensic analysis of bad merges. The implementation is one line of Python where the dispatch is written. The audit value is permanent.
I unpacked the full evolution from tmux pane to autonomous orchestrator separately, receipts were the single most consequential design choice.
The trade-offs (honest)
Three real costs to this architecture, in priority order.
One: log-shaped state needs a compaction strategy. After six months my intelligence archive hit 391 MB. SSE streaming choked. The fix was 30 lines of Python in a nightly cron (scripts/compact_state.py, PR #299), three modes: rotate intelligence archive, cap receipts at 10K, evict open-items digest entries >30d. Atomic writes via temp-file + rename. PR #313 fixed an idempotency bug Codex caught in round 1 review. Not glamorous. Necessary.
Two: schema drift between writers and readers is a real bug class. Pre-PR-322, writers wrote status: "completed" while readers checked verdict == "pass". False-negative loops. The fix was canonicalizing the gate result schema in scripts/lib/gate_status.py::is_pass(). Lesson: write a schema module before you have multiple writers. Not after.
Three: you give up SQL. No JOINs across streams. No transactional consistency. You compensate with jq + grep + awk and aggregator scripts. For most analytic questions this is fine. For complex multi-table queries you are doing more pipeline work than a Postgres-backed system would need.
Anti-claims
A few things I will not claim, because the LIMITATIONS doc matters.
VNX is pre-1.0. The receipt ledger is battle-tested at single-operator scale across six months and 6,000+ entries. Multi-repository orchestration is untested. Ten-plus terminals at scale is untested. The schema is stable enough to bet on, not stable enough to call frozen.
The "free crash recovery" claim assumes your receipt write-path is resilient. If you write receipts asynchronously and crash before flush, you lose the same dispatch you would lose with any other architecture. The architecture is only as good as the write discipline.
What this changes for AI orchestration
Most multi-agent frameworks treat observability as something you bolt on after the system works. That is the wrong order. If you treat the structured receipt as primary, written before the action completes, joined to the instruction by hash, separated by stream from gate decisions, you get observability, audit, and crash recovery in the same architecture for free.
Six months of daily use says this is not theoretical. The repo is open source. The receipts are on disk. Every claim in this post is reproducible from your terminal in thirty seconds.
The chat window is not your system's state. The ledger is.
Read also: The Unified Supervisor Pack: from manual kill -9 to self-healing, how the receipt ledger feeds crash recovery in practice.
If you are designing a similar system from scratch: I help teams with production-grade AI architecture.
Want to talk about applying this to your own AI orchestration stack? Connect on LinkedIn or open an issue on the VNX repo. Honest critique welcome.
Sources & references
- VNX Architecture Manifesto, the four pillars
- VNX Limitations / Anti-claims
- Internal data: 4,666 receipts in
SEOcrawler_v2/.vnx-data/state/t0_receipts.ndjson(parent SEO-tool repo, Aug 2025, present) + 1,526 invnx-roadmap-autopilot-wt/.vnx-data/state/t0_receipts.ndjson(standalone VNX repo, Feb 2026, present), at HEADf3194b3(Apr 30, 2026) - PRs referenced: #275 (state mutation triggers), #287 (expanded triggers), #299 (compact_state), #304 (SSE register stream), #309 (instruction_sha256), #313 (compact idempotency fix), #322 (gate schema canonicalization)
Vincent van Deth
AI Strategy & Architecture
I build production systems with AI — and I've spent the last six months figuring out what it actually takes to run them safely at scale.
My focus is AI Strategy & Architecture: designing multi-agent workflows, building governance infrastructure, and helping organisations move from AI experiments to auditable, production-grade systems. I'm the creator of VNX, an open-source governance layer for multi-agent AI that enforces human approval gates, append-only audit trails, and evidence-based task closure.
Based in the Netherlands. I write about what I build — including the failures.