Multi-AI Code Review at the Merge Gate: 28 Codex Runs, 14 PRs, What the Data Says

Single-provider AI review has a known bias problem. Claude rarely flags Claude's own bad code. The lifelong rationalization that humans bring to their own work, "this looks intentional", "this is fine for the level of polish we're at", "we'll fix this in the next pass", Claude does it too, just faster and with better grammar.

The fix that finally worked in VNX: every PR that touches governance or runtime paths must pass both a Codex gate (OpenAI's codex --dangerously-bypass-approvals-and-sandbox) and a Gemini review (Google Vertex gemini-2.5-pro). Plus deterministic file gates and CI green. Triple-pass required before merge. File-locked, so the orchestrator cannot reason its way around it.

This post is what 28 codex runs across 14 PRs actually showed in the data. Not "AI review is great" or "AI review is broken", just what the data showed, with the warts.

The setup

The pipeline lives in scripts/codex_final_gate.py (584 LOC) and scripts/lib/vertex_ai_runner.py (209 LOC, raw Vertex REST, no SDK). Each gate produces a structured JSON verdict with status: pass/failed and blocking_findings: [...]. The closure verifier (scripts/closure_verifier.py) treats codex_gate status='failed' as a hard blocker (PR #300, after a real incident where a failed-but-completed gate let closure verification pass).

Three gates, three different jobs:

  1. Codex gate, find concrete bugs. Data integrity, idempotency, scope leaks, race conditions. The codex-finds-bugs persona, hardened with the --dangerously-bypass-approvals-and-sandbox flag so it can actually run scripts and reproduce.
  2. Gemini gate, review architecture. Bigger picture: does this PR introduce a coupling we will regret, does the abstraction make sense, are there hidden invariants.
  3. Pre-merge gate, deterministic. File size limits, test coverage thresholds, dispatch-id presence in commit messages, no import anthropic (PR #258 / OI-AT-2). These run first, cheap checks gate the LLM checks.

The pipeline is documented in claudedocs/2026-04-29-codex-findings-synthesis.md, 351 lines of codex-behavior analysis with reproducible counts.

The data: 28 codex runs, 14 PRs

The chain of 14 PRs, #299 through #312, plus #314, was a self-contained sample. Two codex runs each (round 1 and round 2). The synthesis doc breaks it down:

  • Round 1: 14 PRs reviewed. Codex flagged blockers on most.
  • Round 1 → fix → Round 2: developer addresses round-1 findings, codex re-reviews.
  • Round 2 results: 4 PRs ended round-2 zero-blocking. Nine still carried blocking-severity findings.

That last number stings if you read it carelessly. "Nine of fourteen still dirty after a fix-loop round" sounds like the gate is broken.

But here is what the data actually shows when you read the round-2 findings carefully: they cite NEW lines added by round-1 fix commits. Codex is not finding the same thing twice. It is finding new things that the round-1 fixes introduced.

That is genuine deepening, not non-determinism. The fix isn't "stop reviewing." The fix is the severity contract.

28 codex runs, round 1 vs round 2 findings
Round 1: most PRs blocked. Round 2: 4 clean, 9 still flagged, but on NEW lines from round-1 fixes. Iterative deepening, not noise.

The severity contract that fixed everything

The 28-run analysis showed codex defaulting to error severity for cosmetic findings. Plain-text-vs-JSON stderr formatting flagged as error. Variable name clarity flagged as error. Inflated the blocking-finding rate by something like 75%.

The fix was a single prompt update, PR #323/#324:

"Default severity is warning; promote to error only when the impact includes one of: data loss, false-positive PR closure, false-negative PR rejection, security boundary breach, or cross-dispatch state corruption."

That contract went into both scripts/codex_final_gate.py and scripts/lib/vertex_ai_runner.py (so Gemini reviews carried the same severity rules).

After the prompt tightening: round-2 false-blocker rate dropped sharply. Real bugs still got error. Style findings stayed warning. The gate stopped rejecting clean PRs.

This is the most underrated lesson from the experiment. Iterative LLM code review converges if you specify the severity contract. Without one, you get an infinite fix-loop chasing cosmetic findings. With one, you get useful review.

Examples: real bugs that got caught

Five examples from the chain that I would not have caught alone, with PR references for verification.

1. Cross-dispatch data integrity (PR #303)

Codex round 2: "Replacing git add -A and git stash save in subprocess_dispatch.py with manifest-scoped operations is necessary because in a shared worktree (T0+T1+T2 on the same disk), a successful subprocess dispatch had a real path to committing T0's uncommitted edits via git add -A."

This is not theoretical. It would have shipped. Multi-agent shared-worktree systems are a class of bug where Codex shines because it has actually seen this failure mode in similar codebases.

The fix: every dispatch writes a dispatch_paths.json at start listing the paths the worker is allowed to mutate; commit/stash helpers operate only on that scope. Replaced time-window commit attribution with HEAD-comparison against pre-dispatch SHA.

2. Schema drift between writer and reader (PR #322 / CFX-3)

Pre-PR-322: writers wrote status: "completed". Readers checked verdict == "pass". False-negative loop, GateRunner success unrecognized by closure verification. Two ends of the contract spoke different keys. The same class of bug that the log-shaped state architecture is designed to surface early.

Codex flagged this as the root of an entire bug class. The fix: scripts/lib/gate_status.py::is_pass() as single source of truth. Migrate legacy verdict field with a DeprecationWarning.

3. CLI flag forwarding bug (PR #320)

Two PRs back-to-back shipped the same bug: argparse parsed a flag, but the function never received it. Codex pointed out that the missing test was an end-to-end forwarding test, not unit tests of argparse alone.

The fix: tests/test_cli_flag_forwarding.py covers --branch, --mode, --require-github-pr, --pr-id end-to-end.

4. Idempotency in nightly cron (PR #299 → #313)

PR #299 added compact_state.py for nightly NDJSON archive rotation. Codex round 1 caught the archive-write-then-rewrite ordering problem: crash mid-cycle would leave the live file untrimmed forever.

PR #313 was a follow-up specifically to fix the round-1 finding. Atomic temp-file + rename ordering. The follow-up PR exists because codex flagged it. That is the value of the gate making it hard to merge with known issues.

5. Console-error filter eating real errors (PR #305)

Round 2 codex finding: "the new console filter suppresses real React validateDOMNesting errors." The test suite was filtering out the very errors it was supposed to catch.

This is a meta-finding. The test suite that found its own filter was hiding bugs. A pure deterministic gate would never have caught this. Codex did.

What it costs

Honest section. The pipeline is not free.

Time. Codex review on a moderate PR takes 2-5 minutes. Gemini review takes 1-3 minutes. They run in parallel, so overall added time per PR is roughly 5 minutes, but on a stuck PR with multiple round-2s, you are adding 15-30 minutes total before merge.

Tokens. Codex CLI subscription handles auth and quota; Gemini Vertex is metered per request. Across the 28-run chain the total Vertex cost was under $10. Not a budget item. Worth flagging if you scale to hundreds of PRs per week.

Operator time when gates fail. Most failures are real. Some are not. The severity contract update (PR #323/#324) cut the false-positive rate sharply, but you still have to triage occasionally. Budget 10-15% of PRs needing a quick reviewer call: "Is this finding actionable or noise?"

What I will not claim

A few things, because LIMITATIONS matters more than feature lists.

Not "AI review replaces human review." The orchestrator runs the gates. The operator still reviews the merge. The gates are part of the review, not the entirety.

Not "all 14 PRs ended clean." Nine round-2 still had blocking findings. The honest framing: codex finds real bugs on most PRs. The system surfaces them. Whether you fix them or accept them is an operator decision.

Not "this is the only way." Single-provider review with a strong human reviewer can match this on small teams. Multi-provider mutual review wins as you scale and as PRs touch more critical paths. It is not a magic fix for code quality, it is a forcing function for explicit decisions.

Not "Codex 5.2 is identical to 5.4." Provider output schemas drift between versions. PR #307 surfaced exactly that drift. The normalization layer (scripts/lib/runtime_facade.py:get_adapter) handles this with version fixtures, but every model bump is a small adapter PR.

What this changes for AI-assisted code review

Three things, six months in.

One: "AI review" without a severity contract converges on noise. Specify the contract. Default warning. Promote to error only on impact criteria.

Two: Multi-provider mutual review catches what one provider would refuse to ship. The bias is real. Two-provider sanity checks beat one-provider perfectionism.

Three:Deterministic gates run first. LLM gates run last. The orchestrator cannot talk its way past file size limits, missing tests, or import anthropic. That sequencing is the single most underrated piece of the pipeline, and it leans on thereceipt ledger as canonical truth.

The data is on disk. The synthesis doc is open source. Every claim in this post is reproducible from your terminal in thirty seconds.

Read also: The Unified Supervisor Pack: from manual kill -9 to self-healing, what happens when one of these gates fails and the system has to recover on its own.

For teams architecting their own multi-AI review pipeline: I help with AI architecture.


Want to talk about applying multi-AI review to your team's pipeline? Connect on LinkedIn or open an issue on the VNX repo. Honest critique welcome.


Sources & references

  1. VNX Orchestration repo
  2. claudedocs/2026-04-29-codex-findings-synthesis.md, 351 lines of codex-behavior analysis
  3. VNX LIMITATIONS / Anti-claims
  4. PRs referenced: #258 (dispatch-id in commits), #299 + #313 (compact_state + follow-up), #300 (closure verifier hardened), #303 (cross-dispatch data integrity), #305 (console filter finding), #307 (provider schema drift), #320 (CLI flag forwarding test), #322 (gate schema canonicalization), #323 + #324 (severity prompt tightening)

Vincent van Deth

AI Strategy & Architecture

I build production systems with AI — and I've spent the last six months figuring out what it actually takes to run them safely at scale.

My focus is AI Strategy & Architecture: designing multi-agent workflows, building governance infrastructure, and helping organisations move from AI experiments to auditable, production-grade systems. I'm the creator of VNX, an open-source governance layer for multi-agent AI that enforces human approval gates, append-only audit trails, and evidence-based task closure.

Based in the Netherlands. I write about what I build — including the failures.

Reacties

Je e-mailadres wordt niet gepubliceerd. Reacties worden beoordeeld voor plaatsing.

Reacties laden...