AI Quality Gates: Beyond LLM Code Review

An AI agent that reports "done" is making a claim, not a guarantee. I learned to stop merging claims and start merging verified diffs, and the thing that makes the difference is a set of deterministic ai quality gates that run on every completion.

The popular answer to "how do you check what an agent wrote" is LLM code review: point a second model at the diff and ask if it looks right. I used that for months. It has a blind spot I could not engineer around, so I replaced the guarantee part with code that does not negotiate.

The problem: an LLM judge shares the failure mode it is judging

When an agent says it finished a task, that is a statement about its own work. It is the same model, with the same training and the same blind spots, evaluating the same output. A confident wrong answer reviewed by a confident reviewer is still wrong.

LLM review also lets an agent reason its way around a rule. Give a model a policy in plain language and it can produce a plausible justification for a check it actually failed. The justification reads well. The check is still red.

Then there is reproducibility. Run the same diff through an LLM reviewer twice and you can get two different verdicts. A review that is itself probabilistic cannot give you a deterministic guarantee, and a verdict you cannot reproduce is a verdict you cannot audit.

That is the core issue. A reviewer that shares the failure mode of the thing it reviews adds confidence, not certainty.

AI quality gates: governance as code, not as policy

A deterministic gate is a hardcoded check with no model in the decision. Input goes in, a verdict comes out, and it is the same verdict every time. No interpretation, no negotiation.

The distinction that matters is policy versus gate. A policy is a document that a person or an agent can reinterpret. A gate is code that executes and does not care how persuasive the diff is.

Reproducible means auditable. The same receipt produces the same verdict, so I can append that verdict to a ledger and trust it later. These are the checks I actually run:

Gate	Rule	Why
File size	Python file > 800 lines = BLOCK	Oversized files hide complexity
Function length	Function > 70 lines = BLOCK	Long functions resist testing
Dead code	`vulture` finds unreachable code	Agents leave orphaned branches
Secrets	`gitleaks` scans the diff	A leaked key is non-negotiable
Risk patterns	Regex for `eval(`, `shell=True`, hardcoded tokens	Known-dangerous constructs

These gates do not replace LLM review. They put a deterministic floor under it. I still use a model for taste and architecture, the things that genuinely need judgment. The hard guarantees come from code.

Why it runs async: verification after the receipt, zero agent latency

Every agent completion produces a structured record of what happened, written as one line of NDJSON. That record is the input to the gates. The gates run after the record lands, not inside the agent's hot path, so the agent never waits on them.

This matters more than it sounds. Governance now scales independently of agent speed. I can add a heavier check tomorrow without slowing down a single dispatch, because the check runs on the receipt, not in the loop.

The receipt is the foundation the whole thing stands on. I wrote about that ledger in detail in an AI audit trail without a database. The gates are what read it.

Two-phase verification: light at receipt-time, heavy pre-merge

I run gates in two phases because not every check belongs at every moment.

Phase one is receipt-time. Fast checks fire the instant a completion lands. They are non-fatal: they do not stop the run, they log a verdict and a risk score. Over hundreds of dispatches that logging shows me patterns I would never catch one at a time.

Phase two is pre-merge. Here the checks are heavy and blocking: pytest, AST analysis, artifact verification (did the agent actually write the file it claimed?), shell-syntax validation. Nothing merges until this phase is clean.

Cheap and fast for observability, always on. Expensive and slow for enforcement, only at the moment that counts. The agent pays nothing for the first and waits only when it is about to change something permanent.

Verdict and risk score: GO or HOLD as a contract

A gate does not return "looks good." It returns a binary verdict and a number.

The risk score is a weighted sum of the gate hits. A leaked secret weighs far more than a slightly long file, so the score reflects severity, not just count. Below the threshold the verdict is GO. At or above it, the verdict is HOLD.

A HOLD does not mean the system rejects the work on its own. It means the system stops and escalates to me. The gate blocks, the human decides. That is the line I never cross: deterministic enforcement up to the verdict, human judgment past it.

Flow from agent completion to a receipt, through async gates for size, complexity, dead code, secrets and risk patterns, into a risk score that splits into a GO merge path and a HOLD human-review path — A completion becomes a receipt, the gates score it, and the verdict routes to merge or to me

Here is a gate evaluating a receipt. No model anywhere in the logic, which is the entire point:

python

# deterministic_gate.py
# Runs async, post-ingestion, on a single completion receipt.
# No model call anywhere: same receipt in, same verdict out.

RISK_WEIGHTS = {
    "secret_found":      60,
    "risk_pattern":      25,
    "file_too_long":     15,
    "function_too_long": 10,
    "dead_code":          5,
}
HOLD_THRESHOLD = 40

def evaluate(receipt) -> Verdict:
    hits = []
    for change in receipt.changed_files:
        if change.language == "python" and change.line_count > 800:
            hits.append("file_too_long")
        for fn in change.functions:
            if fn.line_count > 70:
                hits.append("function_too_long")
        if gitleaks_scan(change.path):
            hits.append("secret_found")
        if vulture_scan(change.path):
            hits.append("dead_code")
        if RISK_REGEX.search(change.diff):
            hits.append("risk_pattern")

    score = min(100, sum(RISK_WEIGHTS[h] for h in hits))
    verdict = "HOLD" if score >= HOLD_THRESHOLD else "GO"
    return Verdict(verdict=verdict, risk_score=score, hits=hits)
    # verdict is appended to the receipt ledger: auditable, reproducible.

Where frameworks stop and governance has to start

I am not building this because the orchestration frameworks are bad. LangGraph gives you strong graph-based orchestration. CrewAI gives you role-based agent teams. AutoGen gives you flexible multi-agent conversations. They coordinate agents well.

None of them treat verification as a first-class layer. You bolt it on yourself, or you stack another model on the problem and hope that more probability cancels out the uncertainty. It does not. Stacking probabilistic reviewers compounds the doubt, it does not remove it.

In my own runtime, VNX Orchestration, the gates are not a plugin. Receipts, gates, and verdicts are the layer the rest of the system is built around. Across the dispatches I have tracked, the gate layer is what turns "the agent thinks it is done" into "the deliverable is verified." The code is open source at github.com/Vinix24/vnx-orchestration.

Verification over trust as a design principle

Trust does not scale. Verification does. That is the whole argument in five words.

Deterministic gates put a reproducible floor under everything an agent produces. Hardcoded checks, no model in the loop, run async on receipts so they cost the agent nothing, returning a GO or HOLD verdict and a risk score a human can act on. LLM review stays useful for taste and architecture. The hard guarantees come from code that does not negotiate. This is the same principle behind my broader approach to glass box governance for multi-agent AI: if you cannot inspect it, you cannot trust it at scale.

If you run agents that write code, look at the gate layer in the repo and fork it. The fastest way to stop trusting "done" is to make the system prove it.

Vincent van Deth

AI Strategy & Architecture

I build production systems with AI — and I've spent the last six months figuring out what it actually takes to run them safely at scale.

My focus is AI Strategy & Architecture: designing multi-agent workflows, building governance infrastructure, and helping organisations move from AI experiments to auditable, production-grade systems. I'm the creator of VNX, an open-source governance layer for multi-agent AI that enforces human approval gates, append-only audit trails, and evidence-based task closure.

Based in the Netherlands. I write about what I build — including the failures.

LinkedIn Email GitHub

AI Quality Gates: Why I Don't Trust an Agent That Says 'Done'

The problem: an LLM judge shares the failure mode it is judging

AI quality gates: governance as code, not as policy

Why it runs async: verification after the receipt, zero agent latency

Two-phase verification: light at receipt-time, heavy pre-merge

Verdict and risk score: GO or HOLD as a contract

Where frameworks stop and governance has to start

Verification over trust as a design principle

Vincent van Deth

Gerelateerde artikelen

Een eigen AI onder je bureau, voor minder dan een laptop

Een AI audit trail zonder database: waarom ik chat-logs verving door een ledger

Reacties

Marketing Strategie

Marketing Automatisering

AI Implementatie