Async Quality Gates: Why AI Agents Don't Get to Decide When They're Done

The most dangerous moment in a multi-agent workflow is when an agent says "done."

Not because the agent is lying. Not because it's being careless. But because self-assessment is not verification. An LLM can tell you with perfect confidence that it has completed a task, written clean code, added tests, and solved the problem—while simultaneously missing an entire category of requirements, shipping code with side effects, or creating technical debt that compounds for months.

In the systems I've built, I've learned the hard way: if you let agents declare themselves finished, you will deploy incomplete work. And worse, you'll create a false sense of security that the work is complete because the human signed off on what the agent claimed.

This is Part 7 of the Glass Box Governance series. In the previous parts, we built the foundation: receiptsinstead of chat logs,cascade preventionthrough evidence layers,NDJSON ledgersfor auditability, andexternal watchers that observe without controlling. Today, we're adding the gate that actually stops bad work from reaching production: the async quality pipeline that makes closure a system decision, not an agent decision.

The "Agent Says Done" Problem

Here's a typical workflow in most AI-driven systems:

  1. Agent receives task
  2. Agent works on task
  3. Agent declares task complete
  4. Human glances at result
  5. Work ships

The human is almost always looking for obvious breakage: "Did the file get created? Does it run? Are there huge red error messages?" What they almost never have time for is: "Does this code follow our patterns? Are there edge cases? Does this match the architecture? Are we creating technical debt?"

And the agent can't catch these things either, because an LLM doesn't have true understanding of system-wide constraints. It can pattern-match against style guides and architectural documents. But it can't verify that the work fits into the larger system without external validation.

When I started running 2,472 dispatches through the VNX system, I saw patterns emerge. Agents would:

  • Write functions that were 847 lines long (over blocker threshold) without flagging it
  • Create files with missing error handling in paths they couldn't see
  • Solve a problem in a way that worked locally but broke the integration test suite
  • Claim to have added tests when they'd only added test skeletons

Each of these was technically "done" from the agent's perspective. Each would have shipped without our async quality pipeline.

The Quality Advisory Pipeline

My quality advisory pipeline assigns risk scores to every dispatch. Below 0.3: auto-approve. Between 0.3-0.5: careful review. Above 0.5: hold for manual inspection. Above 0.8: block entirely.

The solution is automated, evidence-based closure: a system that generates quality analysis automatically on every completion, attaches it as a sidecar to the receipt, and uses structured findings to decide whether work truly closes or whether it needs T0 review.

Here's how it works in VNX. When a worker (T1, T2, or T3) completes a dispatch, the append_receipt.py script runs before the receipt is even stored. This script:

  1. Analyzes deliverables — reads the files that were supposedly created or modified
  2. Runs quality checks — file size, complexity metrics, pattern matching against the quality_intelligence.db
  3. Generates findings — structured data about what was good, what's risky, what's broken
  4. Attaches a sidecar — JSON metadata appended to the receipt that the receipt processor uses to decide closure

The sidecar looks like this:

json
{
  "decision": "approve_with_followup",
  "risk_score": 72,
  "quality_checks": {
    "file_analysis": "pass",
    "complexity_metrics": "pass",
    "pattern_match": "warn",
    "test_coverage": "warn"
  },
  "findings": [
    {
      "severity": "warn",
      "file": "src/components/BlogEditor.tsx",
      "category": "file_size",
      "message": "Component is 723 lines. Recommended max is 500 lines (warning) or 800 lines (blocker). Consider splitting into subcomponents.",
      "line_range": [1, 723]
    },
    {
      "severity": "warn",
      "file": "src/lib/api/strapi.ts",
      "category": "error_handling",
      "message": "API client has 3 unhandled promise rejections in batch operations. Could cause silent failures.",
      "line_range": [145, 167]
    },
    {
      "severity": "info",
      "file": "tests/BlogEditor.test.tsx",
      "category": "test_coverage",
      "message": "Test file exists but covers only 64% of component. Edge case coverage needed for conditional rendering.",
      "suggestions": ["Add tests for mobile viewport", "Add tests for error states", "Add tests for async loading states"]
    }
  ],
  "open_items": {
    "blockers": [],
    "warnings": 2,
    "deferred": 1
  }
}

Notice the structure: the decision is separate from the findings. The system isn't saying "this is bad work." It's saying "this work has these characteristics, and here's what you need to know to make a decision about whether it closes."

Evidence-Based Closure: The T0 Review Loop

VNX Orchestration Dashboard met quality gates: 63/63 tests, 28 open items, receipt ledger

The receipt processor reads this sidecar and makes an intermediate decision:

  • approve — no blockers, low risk, auto-closes
  • approve_with_followup — warnings exist, but work is functional; T0 reviews the open_items.json and decides
  • hold — blockers detected; work cannot close until T0 explicitly overrides or issues a fix request

Here's the key: T0 is the sole authority for declaring work done. Workers attach evidence. The quality pipeline generates findings. But closure happens in T0, with the open_items.json as the working document:

json
{
  "dispatch_id": "d_20260310_045_001",
  "status": "review_needed",
  "created_at": "2026-03-10T14:32:00Z",
  "track_owner": "T1",
  "deliverables": ["src/components/BlogEditor.tsx", "tests/BlogEditor.test.tsx"],
  "blockers": [],
  "warnings": [
    {
      "id": "warn_file_size_723",
      "file": "src/components/BlogEditor.tsx",
      "message": "723 lines — at warning threshold",
      "mitigation": "T0 decision: refactor into 3 smaller components in follow-up dispatch",
      "status": "deferred"
    },
    {
      "id": "warn_error_handling_batch",
      "file": "src/lib/api/strapi.ts",
      "message": "3 unhandled rejections in batch operations",
      "mitigation": "Fix merged into dispatch; re-verified by T2",
      "status": "closed"
    }
  ],
  "deferred_work": [
    {
      "id": "defer_component_refactor",
      "description": "Split BlogEditor into smaller components",
      "reason": "Improves maintainability; not critical for current PR",
      "created_dispatch": "d_20260311_046_001"
    }
  ],
  "t0_notes": "Approved with deferral. Component works correctly. Size warning is valid; scheduled refactor for next sprint. Test coverage is adequate for current scope.",
  "t0_signed_off_at": "2026-03-10T15:14:00Z"
}

T0 doesn't say "yes" or "no" on a whim. T0 reviews the quality findings, reads the worker's implementation, and makes explicit decisions about what closes now and what becomes a follow-up dispatch. That deferral becomes a proper task in the queue, dependency-aware and tracked.

This is how work actually stays small, focussed, and high-quality. T0 prevents the accumulation of "we'll fix that later" debt by turning it into explicit, scheduled work.

Real Quality Sidecars from 2,472 Dispatches

Let me give you examples from actual VNX runs. These aren't fabricated—they're patterns I see across the system:

Dispatch: Add dark mode toggle to Settings

javascript
risk_score: 34
decision: approve
findings:
  - severity: info
    category: pattern_match
    message: "Matches 3 stored patterns for React theme switching (95% similarity). 
    Low risk of deviation from established architecture."
  - severity: info
    category: test_coverage
    message: "82% coverage. Good coverage of light/dark mode transitions and persistence."

This closes automatically. Low risk, matches known-good patterns, high test coverage.


Dispatch: Migrate blog post API to new Strapi schema

javascript
risk_score: 78
decision: hold
findings:
  - severity: blocker
    category: integration_test
    message: "Dispatch modifies 5 files in src/lib/api/strapi.ts but 2 integration tests 
    fail: test_blogPostWithDeepPopulate and test_nestedCategoryFiltering. These must pass 
    before closure."
  - severity: warn
    category: backwards_compatibility
    message: "Old API signature (getBlogPost(id)) is removed. 3 callsites in src/pages/ 
    still use old signature. Add deprecation period or fix callsites."
  - severity: warn
    category: complexity
    message: "New query construction in strapiClient has 8 nested conditionals. 
    Recommend extracting into helper function for readability."

This one blocks. The blocker (failing tests) is non-negotiable. T0 reads the report, asks T2 to fix the tests, and the dispatch re-enters validation. The warnings are flagged but can be addressed in follow-up work if T0 approves.


Dispatch: Write blog post about AI governance

javascript
risk_score: 12
decision: approve
findings:
  - severity: info
    category: seo_metadata
    message: "Metadata complete: title, description, OG tags. 
    Keywords match top-10 search terms from analytics."
  - severity: info
    category: content_structure
    message: "H1: 1, H2: 4, H3: 8. Good structure for readability."
  - severity: info
    category: markdown_validity
    message: "All links valid, no broken references, images all exist."

This closes automatically. Content work that meets standards, proper metadata, no structural issues.

Gate Progression and Quality Compounding

Where the async quality system becomes powerful is in how it compounds across gates. Work doesn't just pass once. It passes through multiple quality layers, each configured in YAML:

yaml
gates:
  planning:
    primary_agent: analyst
    template: templates/agents/analyst.md
  implementation:
    primary_agent: developer
    snippets: [testing.md]
  review:
    primary_agent: senior-developer
    cognition: deep  # Requires Opus subagent
  validation:
    primary_agent: architect-opus
    cognition: deep
  1. Planning Gate — T0 decides dispatch is well-defined
  2. Implementation Gate — T1 completes work; quality sidecar generated; T0 reviews open_items
  3. Review Gate — T3 performs code review; flags patterns that weren't caught; findings added to open_items
  4. Testing Gate — T2 runs integration tests; failures create blockers; work loops back if needed
  5. Validation Gate — Receipt processor verifies all open items closed; T0 gives final approval

By the time work reaches production, it's been validated at 5 different quality checkpoints, each with evidence. An agent claiming something is done doesn't matter. The system verified it.

And here's what I didn't expect: developers get faster because they know exactly what needs to be done. Instead of vague feedback ("this needs work"), they get structured findings. Instead of back-and-forth, they get a single next-step: "Fix the 3 unhandled rejections, re-run tests, dispatch closes."

The Pattern Learning Loop: 1,143 Stored Patterns

The real magic is that every dispatch teaches the system. The quality_intelligence.db now contains 1,143 patterns—examples of good solutions, common mistakes, architectural gotchas—indexed with full-text search.

When a new dispatch enters the quality pipeline, the analyzer runs:

javascript
SELECT pattern_id, similarity, category FROM patterns 
WHERE category IN ('file_size', 'error_handling', 'test_coverage')
AND similarity > 0.85
ORDER BY frequency DESC

If your new API client matches 92% of a stored pattern for "database migration," the system knows what tends to go wrong in that category. It generates findings based on historical evidence, not just static rules. The 1,143 patterns aren't a checklist; they're a distributed memory of what the system has learned.

This is how the quality gates improve over time. As more work flows through, the pattern database grows. The findings become more precise. False positives decrease. T0 has to review fewer edge cases because the system already understands them.

Where This Breaks Down

I need to be honest about the limits.

Async quality gates catch structural problems, not semantic ones. If your API client is well-written, well-tested, properly sized, and follows patterns—but solves the wrong problem—the system won't catch that. That's what T0 and T3 are for. The quality pipeline is an automated check for how the work was done, not whether it was the right work to do.

The sidecar is only as good as the rules. If I set the file size threshold at 800 lines but your codebase genuinely needs 950-line files, the system will generate false warnings. I've learned that quality thresholds need to be context-aware, and I don't have that built yet.

Self-contained quality metrics miss integration problems. A function can be beautifully written but create a bottleneck when run at scale, or interact badly with another subsystem. The async quality pipeline doesn't run your full system under load. It runs static and unit-level checks.

Patterns can encode bad practices. If 1,043 of your 1,143 patterns are solutions from six months ago, and you refactor your architecture, those patterns become anti-patterns. The system needs active curation. I've had to manually purge bad patterns twice now.

And most importantly: a quality gate is not a replacement for testing or review. It's a precondition. It catches the obvious stuff, escalates the risky stuff, and prevents the definitely-broken stuff from reaching T0. But it doesn't replace human judgment.

Tying It Together: The Full Governance Stack

We've now built something complete. Let's trace a dispatch from creation to closure:

  1. T0 creates the dispatch (Part 2: Glass Box Governance). Receipt created, no chat logs.
  2. T1 works on it (Part 6: External Watcher). Observer watches without controlling; T1 writes files.
  3. Completion triggers quality sidecar (Part 7: Today). append_receipt.py analyzes deliverables, generates findings.
  4. Receipt processor reads sidecar (Part 4: NDJSON Ledger). Decides approve / approve_with_followup / hold based on decision field.
  5. T0 reviews open_items.json (Part 7: Today). If warnings exist, T0 explicitly closes them (deferred, fixed, or re-assessed).
  6. Cascade prevention kicks in (Part 3: Cascade of Doom). Blocker findings create explicit next-steps, not silent failures.
  7. Audit trail is complete (Part 4: NDJSON Ledger). Every decision, every finding, every T0 sign-off is in the ledger.

By the time work reaches production, you don't just have a human's opinion that it's good. You have:

  • Evidence from automated quality analysis
  • Findings from pattern matching across 1,143 stored examples
  • Integration test results
  • Code review from a dedicated reviewer
  • Explicit T0 sign-off with reasoning
  • A full audit trail if something breaks

This is the difference between "shipping code that an agent finished" and "deploying work that the system verified."

What's Next

I'm currently working on two things:

  1. Context-aware thresholds — File size limits that vary by file type. API clients and UI components have different complexity profiles.
  2. Cross-dispatch impact analysis — When a dispatch modifies a dependency, automatically flag all downstream code that might be affected. Reduce cascade risk further.

The Glass Box Governance system isn't perfect, but it's been tested on 2,472 real dispatches. It scales. It catches real problems. And maybe most importantly, it removes the burden of closure from the agent and puts it where it belongs: in the system.

Because an agent should never decide when it's done. The system should.

The full VNX orchestration system — including quality gates, receipt ledger, and dispatch pipeline — is open source on GitHub.


This post is part of the Glass Box Governance series.

Previous: Why Architecture Beats Models — After 2,400+ dispatches, why does your framework choice matter less than you think? Next: From Human-in-the-Loop to Human-on-the-Loop — How do you graduate from approving every agent action to monitoring outcomes?


📚 Glass Box Governance series

  1. One Terminal to Rule Them All: How I Orchestrate Claude, Codex, and Gemini Without Them Knowing About Each Other
  2. Receipts, Not Chat Logs: What 2,472 AI Agent Dispatches Taught Me About Governance
  3. The Cascade of Doom: When AI Agents Hallucinate in Chains
  4. Why I Chose NDJSON Over Postgres for My AI Agent Audit Trail
  5. Claude Agent Teams vs. Building Your Own: What Anthropic Solved (And What They Left Out)
  6. The External Watcher Pattern: How I Observe AI Agents Without Trusting Their Self-Reports
  7. Why Architecture Beats Models: Lessons from 2400+ AI Agent Dispatches
  8. Async Quality Gates: Why AI Agents Don't Get to Decide When They're Done ← you are here
  9. From Human-in-the-Loop to Human-on-the-Loop: A Production Graduation Path
  10. Traceability as Architecture: Designing AI Systems Where Every Decision Has a Receipt
  11. Decision-Making Architecture: Why Autonomous Agents Need Governance, Not Just Instructions
  12. Context Rotation at Scale: How VNX Keeps AI Agents Honest After 10,000 Dispatches
  13. Autonomous Agent Patterns: 5 Production-Tested Approaches for Agents That Run Without You
  14. Governance Scoring: How to Measure Whether Your AI Agent Deserves More Autonomy

Vincent van Deth

AI Strategy & Architecture

I build production systems with AI — and I've spent the last six months figuring out what it actually takes to run them safely at scale.

My focus is AI Strategy & Architecture: designing multi-agent workflows, building governance infrastructure, and helping organisations move from AI experiments to auditable, production-grade systems. I'm the creator of VNX, an open-source governance layer for multi-agent AI that enforces human approval gates, append-only audit trails, and evidence-based task closure.

Based in the Netherlands. I write about what I build — including the failures.

Reacties

Je e-mailadres wordt niet gepubliceerd. Reacties worden beoordeeld voor plaatsing.

Reacties laden...