The most dangerous moment in a multi-agent workflow is when an agent says "done."
Not because the agent is lying. Not because it's being careless. But because self-assessment is not verification. An LLM can tell you with perfect confidence that it has completed a task, written clean code, added tests, and solved the problem—while simultaneously missing an entire category of requirements, shipping code with side effects, or creating technical debt that compounds for months.
In the systems I've built, I've learned the hard way: if you let agents declare themselves finished, you will deploy incomplete work. And worse, you'll create a false sense of security that the work is complete because the human signed off on what the agent claimed.
This is Part 7 of the Glass Box Governance series. In the previous parts, we built the foundation: receiptsinstead of chat logs,cascade preventionthrough evidence layers,NDJSON ledgersfor auditability, andexternal watchers that observe without controlling. Today, we're adding the gate that actually stops bad work from reaching production: the async quality pipeline that makes closure a system decision, not an agent decision.
The "Agent Says Done" Problem
Here's a typical workflow in most AI-driven systems:
- Agent receives task
- Agent works on task
- Agent declares task complete
- Human glances at result
- Work ships
The human is almost always looking for obvious breakage: "Did the file get created? Does it run? Are there huge red error messages?" What they almost never have time for is: "Does this code follow our patterns? Are there edge cases? Does this match the architecture? Are we creating technical debt?"
And the agent can't catch these things either, because an LLM doesn't have true understanding of system-wide constraints. It can pattern-match against style guides and architectural documents. But it can't verify that the work fits into the larger system without external validation.
When I started running 2,472 dispatches through the VNX system, I saw patterns emerge. Agents would:
- Write functions that were 847 lines long (over blocker threshold) without flagging it
- Create files with missing error handling in paths they couldn't see
- Solve a problem in a way that worked locally but broke the integration test suite
- Claim to have added tests when they'd only added test skeletons
Each of these was technically "done" from the agent's perspective. Each would have shipped without our async quality pipeline.
The Quality Advisory Pipeline
My quality advisory pipeline assigns risk scores to every dispatch. Below 0.3: auto-approve. Between 0.3-0.5: careful review. Above 0.5: hold for manual inspection. Above 0.8: block entirely.
The solution is automated, evidence-based closure: a system that generates quality analysis automatically on every completion, attaches it as a sidecar to the receipt, and uses structured findings to decide whether work truly closes or whether it needs T0 review.
Here's how it works in VNX. When a worker (T1, T2, or T3) completes a dispatch, the append_receipt.py script runs before the receipt is even stored. This script:
- Analyzes deliverables — reads the files that were supposedly created or modified
- Runs quality checks — file size, complexity metrics, pattern matching against the quality_intelligence.db
- Generates findings — structured data about what was good, what's risky, what's broken
- Attaches a sidecar — JSON metadata appended to the receipt that the receipt processor uses to decide closure
The sidecar looks like this:
{
"decision": "approve_with_followup",
"risk_score": 72,
"quality_checks": {
"file_analysis": "pass",
"complexity_metrics": "pass",
"pattern_match": "warn",
"test_coverage": "warn"
},
"findings": [
{
"severity": "warn",
"file": "src/components/BlogEditor.tsx",
"category": "file_size",
"message": "Component is 723 lines. Recommended max is 500 lines (warning) or 800 lines (blocker). Consider splitting into subcomponents.",
"line_range": [1, 723]
},
{
"severity": "warn",
"file": "src/lib/api/strapi.ts",
"category": "error_handling",
"message": "API client has 3 unhandled promise rejections in batch operations. Could cause silent failures.",
"line_range": [145, 167]
},
{
"severity": "info",
"file": "tests/BlogEditor.test.tsx",
"category": "test_coverage",
"message": "Test file exists but covers only 64% of component. Edge case coverage needed for conditional rendering.",
"suggestions": ["Add tests for mobile viewport", "Add tests for error states", "Add tests for async loading states"]
}
],
"open_items": {
"blockers": [],
"warnings": 2,
"deferred": 1
}
}Notice the structure: the decision is separate from the findings. The system isn't saying "this is bad work." It's saying "this work has these characteristics, and here's what you need to know to make a decision about whether it closes."
Evidence-Based Closure: The T0 Review Loop

The receipt processor reads this sidecar and makes an intermediate decision:
- approve — no blockers, low risk, auto-closes
- approve_with_followup — warnings exist, but work is functional; T0 reviews the open_items.json and decides
- hold — blockers detected; work cannot close until T0 explicitly overrides or issues a fix request
Here's the key: T0 is the sole authority for declaring work done. Workers attach evidence. The quality pipeline generates findings. But closure happens in T0, with the open_items.json as the working document:
{
"dispatch_id": "d_20260310_045_001",
"status": "review_needed",
"created_at": "2026-03-10T14:32:00Z",
"track_owner": "T1",
"deliverables": ["src/components/BlogEditor.tsx", "tests/BlogEditor.test.tsx"],
"blockers": [],
"warnings": [
{
"id": "warn_file_size_723",
"file": "src/components/BlogEditor.tsx",
"message": "723 lines — at warning threshold",
"mitigation": "T0 decision: refactor into 3 smaller components in follow-up dispatch",
"status": "deferred"
},
{
"id": "warn_error_handling_batch",
"file": "src/lib/api/strapi.ts",
"message": "3 unhandled rejections in batch operations",
"mitigation": "Fix merged into dispatch; re-verified by T2",
"status": "closed"
}
],
"deferred_work": [
{
"id": "defer_component_refactor",
"description": "Split BlogEditor into smaller components",
"reason": "Improves maintainability; not critical for current PR",
"created_dispatch": "d_20260311_046_001"
}
],
"t0_notes": "Approved with deferral. Component works correctly. Size warning is valid; scheduled refactor for next sprint. Test coverage is adequate for current scope.",
"t0_signed_off_at": "2026-03-10T15:14:00Z"
}T0 doesn't say "yes" or "no" on a whim. T0 reviews the quality findings, reads the worker's implementation, and makes explicit decisions about what closes now and what becomes a follow-up dispatch. That deferral becomes a proper task in the queue, dependency-aware and tracked.
This is how work actually stays small, focussed, and high-quality. T0 prevents the accumulation of "we'll fix that later" debt by turning it into explicit, scheduled work.
Real Quality Sidecars from 2,472 Dispatches
Let me give you examples from actual VNX runs. These aren't fabricated—they're patterns I see across the system:
Dispatch: Add dark mode toggle to Settings
risk_score: 34
decision: approve
findings:
- severity: info
category: pattern_match
message: "Matches 3 stored patterns for React theme switching (95% similarity).
Low risk of deviation from established architecture."
- severity: info
category: test_coverage
message: "82% coverage. Good coverage of light/dark mode transitions and persistence."This closes automatically. Low risk, matches known-good patterns, high test coverage.
Dispatch: Migrate blog post API to new Strapi schema
risk_score: 78
decision: hold
findings:
- severity: blocker
category: integration_test
message: "Dispatch modifies 5 files in src/lib/api/strapi.ts but 2 integration tests
fail: test_blogPostWithDeepPopulate and test_nestedCategoryFiltering. These must pass
before closure."
- severity: warn
category: backwards_compatibility
message: "Old API signature (getBlogPost(id)) is removed. 3 callsites in src/pages/
still use old signature. Add deprecation period or fix callsites."
- severity: warn
category: complexity
message: "New query construction in strapiClient has 8 nested conditionals.
Recommend extracting into helper function for readability."This one blocks. The blocker (failing tests) is non-negotiable. T0 reads the report, asks T2 to fix the tests, and the dispatch re-enters validation. The warnings are flagged but can be addressed in follow-up work if T0 approves.
Dispatch: Write blog post about AI governance
risk_score: 12
decision: approve
findings:
- severity: info
category: seo_metadata
message: "Metadata complete: title, description, OG tags.
Keywords match top-10 search terms from analytics."
- severity: info
category: content_structure
message: "H1: 1, H2: 4, H3: 8. Good structure for readability."
- severity: info
category: markdown_validity
message: "All links valid, no broken references, images all exist."This closes automatically. Content work that meets standards, proper metadata, no structural issues.
Gate Progression and Quality Compounding
Where the async quality system becomes powerful is in how it compounds across gates. Work doesn't just pass once. It passes through multiple quality layers, each configured in YAML:
gates:
planning:
primary_agent: analyst
template: templates/agents/analyst.md
implementation:
primary_agent: developer
snippets: [testing.md]
review:
primary_agent: senior-developer
cognition: deep # Requires Opus subagent
validation:
primary_agent: architect-opus
cognition: deep- Planning Gate — T0 decides dispatch is well-defined
- Implementation Gate — T1 completes work; quality sidecar generated; T0 reviews open_items
- Review Gate — T3 performs code review; flags patterns that weren't caught; findings added to open_items
- Testing Gate — T2 runs integration tests; failures create blockers; work loops back if needed
- Validation Gate — Receipt processor verifies all open items closed; T0 gives final approval
By the time work reaches production, it's been validated at 5 different quality checkpoints, each with evidence. An agent claiming something is done doesn't matter. The system verified it.
And here's what I didn't expect: developers get faster because they know exactly what needs to be done. Instead of vague feedback ("this needs work"), they get structured findings. Instead of back-and-forth, they get a single next-step: "Fix the 3 unhandled rejections, re-run tests, dispatch closes."
The Pattern Learning Loop: 1,143 Stored Patterns
The real magic is that every dispatch teaches the system. The quality_intelligence.db now contains 1,143 patterns—examples of good solutions, common mistakes, architectural gotchas—indexed with full-text search.
When a new dispatch enters the quality pipeline, the analyzer runs:
SELECT pattern_id, similarity, category FROM patterns
WHERE category IN ('file_size', 'error_handling', 'test_coverage')
AND similarity > 0.85
ORDER BY frequency DESCIf your new API client matches 92% of a stored pattern for "database migration," the system knows what tends to go wrong in that category. It generates findings based on historical evidence, not just static rules. The 1,143 patterns aren't a checklist; they're a distributed memory of what the system has learned.
This is how the quality gates improve over time. As more work flows through, the pattern database grows. The findings become more precise. False positives decrease. T0 has to review fewer edge cases because the system already understands them.
Where This Breaks Down
I need to be honest about the limits.
Async quality gates catch structural problems, not semantic ones. If your API client is well-written, well-tested, properly sized, and follows patterns—but solves the wrong problem—the system won't catch that. That's what T0 and T3 are for. The quality pipeline is an automated check for how the work was done, not whether it was the right work to do.
The sidecar is only as good as the rules. If I set the file size threshold at 800 lines but your codebase genuinely needs 950-line files, the system will generate false warnings. I've learned that quality thresholds need to be context-aware, and I don't have that built yet.
Self-contained quality metrics miss integration problems. A function can be beautifully written but create a bottleneck when run at scale, or interact badly with another subsystem. The async quality pipeline doesn't run your full system under load. It runs static and unit-level checks.
Patterns can encode bad practices. If 1,043 of your 1,143 patterns are solutions from six months ago, and you refactor your architecture, those patterns become anti-patterns. The system needs active curation. I've had to manually purge bad patterns twice now.
And most importantly: a quality gate is not a replacement for testing or review. It's a precondition. It catches the obvious stuff, escalates the risky stuff, and prevents the definitely-broken stuff from reaching T0. But it doesn't replace human judgment.
Tying It Together: The Full Governance Stack
We've now built something complete. Let's trace a dispatch from creation to closure:
- T0 creates the dispatch (Part 2: Glass Box Governance). Receipt created, no chat logs.
- T1 works on it (Part 6: External Watcher). Observer watches without controlling; T1 writes files.
- Completion triggers quality sidecar (Part 7: Today).
append_receipt.pyanalyzes deliverables, generates findings. - Receipt processor reads sidecar (Part 4: NDJSON Ledger). Decides approve / approve_with_followup / hold based on decision field.
- T0 reviews open_items.json (Part 7: Today). If warnings exist, T0 explicitly closes them (deferred, fixed, or re-assessed).
- Cascade prevention kicks in (Part 3: Cascade of Doom). Blocker findings create explicit next-steps, not silent failures.
- Audit trail is complete (Part 4: NDJSON Ledger). Every decision, every finding, every T0 sign-off is in the ledger.
By the time work reaches production, you don't just have a human's opinion that it's good. You have:
- Evidence from automated quality analysis
- Findings from pattern matching across 1,143 stored examples
- Integration test results
- Code review from a dedicated reviewer
- Explicit T0 sign-off with reasoning
- A full audit trail if something breaks
This is the difference between "shipping code that an agent finished" and "deploying work that the system verified."
What's Next
I'm currently working on two things:
- Context-aware thresholds — File size limits that vary by file type. API clients and UI components have different complexity profiles.
- Cross-dispatch impact analysis — When a dispatch modifies a dependency, automatically flag all downstream code that might be affected. Reduce cascade risk further.
The Glass Box Governance system isn't perfect, but it's been tested on 2,472 real dispatches. It scales. It catches real problems. And maybe most importantly, it removes the burden of closure from the agent and puts it where it belongs: in the system.
Because an agent should never decide when it's done. The system should.
The full VNX orchestration system — including quality gates, receipt ledger, and dispatch pipeline — is open source on GitHub.
This post is part of the Glass Box Governance series.
Previous: Why Architecture Beats Models — After 2,400+ dispatches, why does your framework choice matter less than you think? Next: From Human-in-the-Loop to Human-on-the-Loop — How do you graduate from approving every agent action to monitoring outcomes?
📚 Glass Box Governance series
- One Terminal to Rule Them All: How I Orchestrate Claude, Codex, and Gemini Without Them Knowing About Each Other
- Receipts, Not Chat Logs: What 2,472 AI Agent Dispatches Taught Me About Governance
- The Cascade of Doom: When AI Agents Hallucinate in Chains
- Why I Chose NDJSON Over Postgres for My AI Agent Audit Trail
- Claude Agent Teams vs. Building Your Own: What Anthropic Solved (And What They Left Out)
- The External Watcher Pattern: How I Observe AI Agents Without Trusting Their Self-Reports
- Why Architecture Beats Models: Lessons from 2400+ AI Agent Dispatches
- Async Quality Gates: Why AI Agents Don't Get to Decide When They're Done ← you are here
- From Human-in-the-Loop to Human-on-the-Loop: A Production Graduation Path
- Traceability as Architecture: Designing AI Systems Where Every Decision Has a Receipt
- Decision-Making Architecture: Why Autonomous Agents Need Governance, Not Just Instructions
- Context Rotation at Scale: How VNX Keeps AI Agents Honest After 10,000 Dispatches
- Autonomous Agent Patterns: 5 Production-Tested Approaches for Agents That Run Without You
- Governance Scoring: How to Measure Whether Your AI Agent Deserves More Autonomy
Vincent van Deth
AI Strategy & Architecture
I build production systems with AI — and I've spent the last six months figuring out what it actually takes to run them safely at scale.
My focus is AI Strategy & Architecture: designing multi-agent workflows, building governance infrastructure, and helping organisations move from AI experiments to auditable, production-grade systems. I'm the creator of VNX, an open-source governance layer for multi-agent AI that enforces human approval gates, append-only audit trails, and evidence-based task closure.
Based in the Netherlands. I write about what I build — including the failures.