Multi-Agent Supervisor: self-healing orchestration

On April 28, 2026 my dispatcher exited silently around 14:30. By the time I noticed at 16:00, six stale terminal leases were blocking every new dispatch. I ended up running this six times before lunch the next day:

sql

UPDATE terminal_leases SET state='idle' WHERE terminal='T1';

That is not a system. That is a person manually pretending to be a system. The same gap I described in the VNX evolution from tmux pane to autonomous orchestrator, manual cleanup is a signal that an architectural layer is missing.

The fix that closed the failure mode is one of those features that earn no credit in marketing screenshots but quietly make the difference between "production-grade" and "still a side project." It is documented in claudedocs/2026-04-29-unified-supervisor-research.md, 282 lines of research into the failure, the diagnosis grid, and the three patterns I evaluated before picking one.

This post is the engineering story of the unified supervisor pack. What broke. What I considered. What shipped. And the trade-offs I made.

The April 28 outage, briefly

Around 14:30 the dispatcher (scripts/dispatcher_v8_minimal.sh) exited silently. No crash log. No error trace. Probably a SIGTERM from a system event, laptop sleep + USB-C disconnect is the prime suspect, though I could not prove it.

The cascade:

Dispatcher dies. Terminal lease for T1 stays at state='leased' because there is no graceful-exit cleanup.
New dispatches arrive in dispatches/pending/. The next dispatcher restart wants T1, but T1 is "leased" to the dead worker.
Operator (me) does not notice for ~90 minutes. By then five more dispatches have queued and one of them got picked up by a worker that also exited mid-run, leaving its own stale lease.
I manually UPDATE the lease table. Restart. UPDATE again. Restart. Six cycles before I gave up and decided "this needs to never happen again."

The diagnosis grid in the research doc lists 14 components. Eleven were already present and could have prevented the cascade. They just were not wired together. The fix is not new infrastructure, it is wiring.

The three patterns I considered

I am putting this section first because the not chosen patterns matter.

Pattern A, launchd-first

macOS-native. Wrap every long-running process in a launchd plist with KeepAlive and StartInterval. launchd handles restart, logging, and signal escalation.

Pros: OS-level reliability. Survives operator reboots. Standard macOS pattern.

Rejected because: It hides too much. When launchd restarts the process I do not get visibility into why it restarted. Lease cleanup still has to live somewhere. And it bound VNX to macOS, Linux deployment would need systemd equivalents.

Pattern B, wrapper-script-first (chosen)

A bash supervisor wrapping the dispatcher with explicit logic: exponential backoff, stable-runtime reset, SIGTERM→SIGKILL escalation, stale-lock cleanup. Visible. Debuggable. Portable.

Pros: Explicit logic in 184 LOC. Operator can read the supervisor and understand exactly what it does. Works on macOS and Linux. Works inside a tmux pane during dev.

Trade-offs: No survival across operator reboots without a launchd shim on top. Acceptable, option to add later (Phase 2).

Pattern C, inline self-supervision

Dispatcher monitors its own health, restarts itself on detected failures.

Rejected because: A process cannot reliably restart itself after death. If the dispatcher is dead, no code is running to detect it. This is the equivalent of asking the corpse to call 911.

The choice: Pattern B. Wrapper-script-first, with explicit launchd as a follow-up if needed.

Three patterns evaluated for the unified supervisor — Pattern A (launchd-first), Pattern B (wrapper-script, chosen), Pattern C (inline self-supervision, rejected). Trade-offs visible per layer.

What actually shipped (six PRs, ~905 LOC)

The unified supervisor pack landed across six PRs. None over 300 LOC. Each independently revertable.

Layer 1, wrapper-script supervisors

scripts/dispatcher_supervisor.sh (184 LOC, PR #242 expanded in #318) and scripts/receipt_processor_supervisor.sh (PR #319, similar shape).

Behavior:

Exponential backoff, init 2s, max 60s, env-tunable via VNX_SUPERVISOR_BACKOFF_INIT/MAX
Stable-runtime reset, after 60s of stable child uptime, reset backoff to init
SIGTERM→SIGKILL escalation, child gets SIGTERM, then SIGKILL after 10s grace
Stale-lock cleanup, _clear_stale_dispatcher_lock() runs before each restart, removes lock dirs whose PID is dead
Mode flag, VNX_SUPERVISOR_MODE=legacy|unified for safe rollout

Why this matters: naive restart loops thrash on persistent failures. A syntax error after a bad merge causes a tight CPU-burning restart cycle until you notice. With backoff + stable-runtime reset, transient crashes recover fast (2s) and pathological ones slow down (60s) without operator triage.

Layer 2, throttled lease sweep

scripts/lib/lease_sweep.py runs every 30 seconds inside the dispatcher prelude. Calls LeaseManager.expire_stale() (scripts/lib/lease_manager.py, 419 LOC).

Behavior:

Reaps terminal_leases.state='leased' rows whose worker died without graceful exit
30-second TTL (faster reap = more risk of false positives; slower = stale-lease debt accumulates)
Idempotent, running twice in 30s is safe

Why 30 seconds: dispatch ack timeouts are typically 60-90 seconds, so a 30s sweep gives the system one chance to recover gracefully before lease cleanup kicks in. Not aggressive. Not lazy. Tuned to the system's natural cadence.

Layer 3, single-owner worker-exit cleanup

scripts/lib/cleanup_worker_exit.py (PR #315). The single place that:

Releases the lease
Transitions the worker FSM to exited_clean or exited_bad
Moves the dispatch file to completed/ or rejected/
Writes the audit event

Idempotent. Never raises. Single owner, every other code path that handles worker exit imports this helper. No more "this code path forgot to release the lease" bugs.

Pre-PR-315, there were three different code paths releasing leases, and at least one had a corner case where it forgot. PR #315 collapses them into one helper.

Why three layers, not one

A frequent question: why not just put all the cleanup in the supervisor and skip the lease sweep and the helper?

Three reasons.

One: The supervisor only sees process exit. It does not see "worker is alive but stuck." The lease sweep covers stuck workers that did not exit cleanly. Different failure mode, different layer.

Two: The single-owner helper is called from many places, graceful exit, supervisor cleanup, lease sweep, manual operator action. Centralizing the logic means one place to fix bugs. Spreading it across the supervisor means three places.

Three: Defense in depth. If the supervisor has a bug, the lease sweep still catches stale leases within 30 seconds. If the lease sweep has a bug, the helper still does the right thing when called manually. The system tolerates one layer being broken.

This is the pattern I learned in industrial automation (ISA/IEC 62443), zones and conduits, defense in depth, no single point of failure. I wrote about it in ISA/IEC 62443 applied to AI governance. The principles apply directly to multi-agent orchestration.

📖 Read also: The External Watcher Pattern: How I Observe AI Agents Without Trusting Their Self-Reports: why observation layers must be independent from the agents they watch.

Anti-claims

Honest section. Three things this is not.

Not a substitute for proper crash recovery. The supervisor restarts processes. The receipt ledger still has to be the source of truth for state, and scripts/build_t0_state.py still has to rebuild derived state from receipts on every restart. This is documented in the Glass-Box Governance post, the supervisor is one layer; the ledger is the other.

Not a "the system never crashes" claim. The system still crashes. It just recovers without manual intervention in most cases. The unified supervisor research doc lists three categories of failure that still require human triage, primarily corrupted state files (rare) and config errors after deploy (also rare).

Not multi-machine HA. This is single-machine resilience. Distributed-multi-machine HA for VNX is on the roadmap but not shipped. Supervisor pack assumes one operator, one box, one tmux session.

What it changes day-to-day

Six months into running this:

Stale leases per week: was ~3-5, now ~0 (typical week is zero events)
Manual operator interventions: was ~6 per outage, now 0-1 (auto-recovery handles most)
Time from dispatcher crash to back-online: was 5-30 minutes (depends on when I noticed), now ≤4 seconds
Operator memory load: was constant ("is the dispatcher alive? is T1 leased?"), now low ("the supervisor handles it")

That last one is the headline. Self-healing is not a metric. It is what is missing from your day. You stop thinking about it.

The 6+ manual kill -9 + UPDATE cycles I did on April 28 were the last set of those. Since the unified supervisor shipped, zero. The pack pays for itself in operator-time saved within the first week.

What is next

Three roadmap items for the supervisor pack:

launchd integration (Phase 2), wrap the supervisor itself in launchd for survival across operator reboots. Optional. Gives 95% of value with 5% of complexity, per the research doc.
Cross-machine HA (Phase 3), distributed lease management for multi-machine deployments. Big project. Currently single-machine.
Better failure-mode telemetry, when the supervisor restarts the child, emit a structured receipt with detected exit signal, last successful heartbeat, and runtime duration. Enables time-series analysis of "what kills my dispatcher".

Each is its own PR sequence. The current state is "good enough for production", better-than-fine, but with known room.

📖 Read also: Multi-AI Code Review at the Merge Gate: the other gate that keeps this system from shipping broken code.

For teams architecting their own self-healing multi-agent system: I help with AI architecture.

Want to apply this pattern to your own multi-agent setup? The supervisor pack is open source on the VNX repo, scripts/dispatcher_supervisor.sh, scripts/receipt_processor_supervisor.sh, scripts/lib/cleanup_worker_exit.py. Issues and PRs welcome.

Sources & references

VNX Orchestration repo
claudedocs/2026-04-29-unified-supervisor-research.md, 282 lines of postmortem and pattern analysis
docs/operations/UNIFIED_SUPERVISOR.md, operator guide with cutover plan
PRs: #242 (initial supervisor), #315 (cleanup_worker_exit helper, SUP-PR1), #318 (operator guide, SUP-PR5), #319 (receipt processor supervisor, SUP-PR4)
Related: Glass-Box Governance, receipts as database, ISA/IEC 62443 to AI governance

Vincent van Deth

AI Strategy & Architecture

I build production systems with AI — and I've spent the last six months figuring out what it actually takes to run them safely at scale.

My focus is AI Strategy & Architecture: designing multi-agent workflows, building governance infrastructure, and helping organisations move from AI experiments to auditable, production-grade systems. I'm the creator of VNX, an open-source governance layer for multi-agent AI that enforces human approval gates, append-only audit trails, and evidence-based task closure.

Based in the Netherlands. I write about what I build — including the failures.

LinkedIn Email GitHub

The Unified Supervisor Pack: From 6 Manual `kill -9 + SQL UPDATE` Cycles to One Helper