When Andrej Karpathy talks about training neural networks, one of his recurring themes is the tooling: track every hyperparameter against every outcome metric, look for correlations, learn what shape of input correlates with what shape of output. That discipline turned ML practice from "try things and hope" into something closer to a science.
I think multi-agent orchestration deserves the same treatment. And until we apply it, we are running our AI dispatches the way ML practitioners ran experiments in 2014, with vibes.
This post is about F57 in VNX, the dispatch parameter tracker. What it records, how it correlates, and what surprised me when the data started landing.
The premise
Every dispatch in VNX has a shape, a set of parameters that, in principle, should predict the outcome. Some of them are obvious:
- Instruction length, how many tokens is the prompt
- File scope, how many files does the worker have permission to touch
- Tool count, how many tools are exposed to the worker (Bash, Edit, Write, Grep, Read, etc.)
- Provider, Claude Sonnet, Claude Opus, Codex, Gemini
- Model version, Claude 4.6, Claude 4.7, Codex 5.2, Codex 5.4
And some are less obvious but worth tracking:
- Time of day the dispatch starts
- Worktree state, clean, dirty-low, dirty-high
- Dispatch source, staging→promote vs popup vs autonomous
- Whether prior context was injected (intelligence pattern + how much)
The premise: with enough dispatches, correlations emerge. Some predictive, some surprising, some falsely correlated. The dispatch parameter tracker (F57, hooked into the receipt write path) records all of this so we can do the analysis after.
What the tracker records (concrete)
Per dispatch, an enriched receipt with these fields:
{
"dispatch_id": "20260513-103022-feature-foo-A",
"terminal": "T1",
"lane": "claude-subprocess",
"provider": "anthropic",
"model": "claude-sonnet-4-6",
"instruction_length_chars": 1842,
"instruction_length_tokens": 487,
"file_scope_count": 3,
"tool_count_exposed": 8,
"worktree_state": "clean",
"dispatch_source": "promote",
"intelligence_injected": true,
"intelligence_pattern_count": 2,
"ts_start": "2026-05-13T10:30:22Z",
"duration_sec": 247,
"exit_status": "success",
"quality_score": 0.84,
"cost_est_usd": 0.18,
"rework_required": false
}The quality_score comes from a separate quality-advisory pass, a deterministic lint-like check on the diff (test coverage delta, file-size delta, conformance to project style). Not a perfect proxy for human-judged quality, but a decent signal that's automatable.
provider and lane come from the provider-aware receipt. Each receipt now carries provider, model, and token or cost data, so the tracker can correlate behavior across lanes, not just models.
rework_required is set retroactively if the same dispatch_id triggers a follow-up "fix the issues" dispatch within 24 hours. That is one of the more useful correlation targets.
Correlations that emerged
After a few thousand dispatches across the VNX system, the data started showing patterns. Some matched my intuition. Some did not.
Correlation 1, Instruction length vs quality (non-monotonic)
I expected: longer instruction = clearer task = higher quality.
The data showed: a non-monotonic relationship. Below ~150 tokens of instruction, quality drops (under-specification). Between 150 and 800 tokens, quality is roughly flat. Above 800 tokens, quality drops again (over-specification, or the worker losing track of what matters).
The takeaway is not "write 400-token prompts." It is don't pad. The 800-token threshold is consistent across providers and models in my data, Sonnet, Opus, Codex 5.4 all show the same shape. There is a quality cliff around heavy prompts.
Correlation 2, Tool count vs duration (sublinear)
Expected: more tools = more capability = faster completion.
Data: more tools = longer completion. Even when the additional tools were not used. The hypothesis is that more tools in the prompt = more cognitive load on the worker = slower decision-making.
I now expose the minimum useful tool set per skill, not the full tool palette. Median duration dropped 20-30% on dispatches where I trimmed tools. This was the most surprising finding from the tracker.
Correlation 3, Worktree dirtiness vs rework (very strong)
Expected: dirty worktrees are slightly less reliable.
Data: dirty worktrees have 3-5x higher rework rate than clean. The hypothesis: when a worker starts in a dirty worktree, it can confuse "what existed before this dispatch" with "what I am supposed to deliver", leading to commits that mix unrelated work or override pending changes.
This was the data that pushed me to make worktree-state into a hard gate. Pre-dispatch check: if is_dirty = true and dirty_high, the dispatcher refuses with a clear "clean the tree first" error. Saves rework in the >90% of cases where dirty was accidental.
Correlation 4, Time of day (small but real)
Expected: no effect.
Data: dispatches started between 22:00-06:00 local time have ~10% lower quality scores. Probably because the worker is more often headless during those hours (no operator watching, less correction loop). When the dispatch is in the same time-of-day window as me being awake, quality is slightly higher because I catch and correct issues mid-flight.
This is a "you, the human, are the variable" effect more than a worker-state effect. But it shows up consistently.
Correlation 5, Lane vs cost (structural)
Expected: model choice drives cost.
Data: lane choice drives cost more. The ephemeral tmux-spawn lane runs on subscription credits, while the claude-subprocess lane bills to API credits after the June 15 Anthropic billing change. A tmux-spawn dispatch costs zero API dollars but has interactive startup overhead. The claude-subprocess lane is faster but no longer free. This changes dispatch economics in ways model choice does not.

📖 Read also: Glass-Box Governance: Receipts as the Database: why the receipt ledger is the source of truth
What the tracker does NOT show
Three honest disclaimers, because correlations need careful framing.
Not causation. This is correlation. The "more tools = longer duration" might not be tools causing slowdown, it might be that complex tasks both expose more tools AND take longer. Without controlled experiments, you cannot distinguish.
Not stationary. Model behavior changes between versions. The Sonnet 4.6 → 4.7 transition shifted some of these correlations. The tracker keeps recording; the analysis windows have to be model-version-aware.
Not generalizable to your codebase. My VNX is a Python+bash system with NDJSON state. Your codebase has different shapes, different complexity, different review cycles. The pattern (track parameters, look for correlations) generalizes; the specific findings might not.
What surprised me
Three findings I genuinely did not expect.
One: intelligence injection (passing prior patterns into the prompt) helps less than I thought. The quality-score uplift from injected patterns is ~3-5%, not the 15-20% I had assumed when I built the system. Useful, but not the dominant factor I had positioned it as.
Two: the biggest predictor of success is whether the worker had access to the right files, not the prompt phrasing, not the model, not the tool set. File scope coverage is the most important hyperparameter. This means dispatch manifest design matters more than prompt design.
Three: model version impact is small. Going from Claude 4.6 to 4.7 shifted average quality scores by ~5%. Going from "bad file scope" to "good file scope" shifted them by ~25%. The model upgrade was free; the architectural improvement was 5x bigger.
Four: the tracker caught the tmux-spawn idle-worker bug before I noticed it manually. Instruction length, file scope, and tool count were all normal. But duration was an outlier near zero seconds and no quality advisory was produced because no files changed. The receipt was a lane-synthesized fallback. F57 flagged it as a parameter-outcome mismatch. That turned a silent failure into a detectable pattern.
The headline I would write: the variance in your AI workflow comes more from your dispatch design than from your model choice. That is the hidden takeaway from the F57 data.
Anti-claims
A few things I will not claim, because LIMITATIONS matters.
Not a quality prediction model. F57 records correlations. It does not predict whether a specific dispatch will succeed. The correlations are weak-to-moderate (R² values typically 0.15-0.40), not strong enough to gate dispatches on predicted quality.
Not statistical significance at scale. A few thousand dispatches is enough to see patterns. It is not enough for tight confidence intervals on every variable. Bigger numbers needed for some of the time-of-day or model-version analyses.
Not a substitute for human review. The quality_score is a heuristic. It does not catch design issues, business-logic mistakes, or architectural concerns. Track + correlate is for operational tuning, not for replacing review.
What I do with the data
Three concrete uses:
- Adjust skills. When data shows a skill consistently produces 30% above-average rework rate, the skill prompt gets tightened. F57 turns "I think this skill is bad" into "the data says it has 2.4x median rework rate, here is a tightened version."
- Pre-dispatch policy. Worktree-dirty gate, file-scope minimums, instruction-length warnings, all derived from the data.
- Postmortem context. When a dispatch goes wrong, F57 gives me the parameter shape that produced the failure. Compare to the average for similar dispatches. Often surfaces the variable that mattered.
The tracker is open source. The receipts are on disk. Anyone running VNX can replicate the analysis on their own data.
📖 Read also: Multi-AI Code Review at the Merge Gate: 28 Codex runs and what the data says about review quality
What's next
Three roadmap items for the tracker:
- Multi-dispatch DAGs, when a single user-task expands into 5 dispatches in sequence, track the DAG shape not just individual nodes. Current tracker is per-dispatch only.
- Provider-specific weighting, Codex 5.2 vs 5.4 had different output schemas. The tracker should weight historical data by model-version.
- Causal inference experiments, when I have time, A/B test specific changes (e.g., minimum tool set vs full tool set on the same task) and confirm the correlations as causal.
Each is its own project. The current state is "useful for tuning, not yet predictive enough to gate."
Update: June 2026
VNX reached 1.0 code-freeze in early June. The F57 tracker is still manual jq queries, but the schema has stabilized. Lane and provider are now first-class parameters, driven by the shift to a six-lane architecture and the June 15 Anthropic billing change. The tmux-spawn lane is becoming the default. The tracker already caught one real idle-worker anomaly through parameter-outcome mismatch. The dashboard is still on the roadmap.
Want to apply the same pattern to your AI orchestration? The VNX repo is open source. The receipt schema and aggregator scripts are reusable. Or connect on LinkedIn for the build-in-public updates.
Sources & references
Vincent van Deth
AI Strategy & Architecture
I build production systems with AI — and I've spent the last six months figuring out what it actually takes to run them safely at scale.
My focus is AI Strategy & Architecture: designing multi-agent workflows, building governance infrastructure, and helping organisations move from AI experiments to auditable, production-grade systems. I'm the creator of VNX, an open-source governance layer for multi-agent AI that enforces human approval gates, append-only audit trails, and evidence-based task closure.
Based in the Netherlands. I write about what I build — including the failures.