Five Model Families, One Agentic Harness

If you benchmark frontier models with a single chat completion, you are measuring the wrong thing.

A one-shot call gives the model no tools. It cannot read a file, run a test, or fix its own mistake. So the score you get back is not "how good is this model at coding," it is "how good is this model at writing code blind, in one pass." For agentic coding work that gap is the whole story.

I wanted a fair comparison across Claude (Opus, Sonnet), OpenAI Codex (gpt-5.5), Z.AI GLM-5.x, Moonshot Kimi K2.7 and DeepSeek V4. Same tasks, same scoring, same agentic loop. The hard part was never the scoring. It was getting five different model families to run inside one harness without each one failing for a different infrastructure reason.

This is a peer write-up. If you are building something similar through Claude, the proxy pattern, the hardening notes and the numbers below will save you a few nights.

The fairness problem, stated plainly

Claude and DeepSeek were easy: both speak the Anthropic Messages protocol, so both drive the claude CLI directly. The CLI gives them the full agentic harness for free. Tools, file I/O, sub-agents, iteration.

The other three were the problem.

GLM and Kimi went through a simple chat runner at first. No tools. Their correctness scored zero across every cell, not because the models are weak, but because the harness could not let them produce a deliverable. The harness was the bottleneck, not the model. That is a measurement bug, and it is the kind of bug that quietly poisons a benchmark if you do not catch it.

So I had two options. Build a second agentic harness for the OpenAI protocol, or bridge the OpenAI-shaped models into the Claude harness I already trusted. I did both, and the second one is the interesting one.

The proxy trick: drive any model through the Claude CLI

The claude CLI talks to an Anthropic /v1/messages endpoint. It reads ANTHROPIC_BASE_URL to decide where. That is the whole hook.

Put a local litellm proxy on localhost:4141 that exposes an Anthropic-compatible /v1/messages, backed by GLM on OpenRouter:

yaml

# litellm proxy config: Anthropic-shaped front, OpenRouter GLM back
model_list:
  - model_name: glm-5.2
    litellm_params:
      model: openrouter/z-ai/glm-5.2
      api_key: os.environ/OPENROUTER_API_KEY
litellm_settings:
  drop_params: true        # tolerate Anthropic params OpenRouter does not map 1:1
general_settings:
  master_key: sk-local-only # local auth between the CLI and this proxy

Then point the CLI at it:

bash

ANTHROPIC_BASE_URL=http://localhost:4141 \
ANTHROPIC_AUTH_TOKEN=sk-local-only \
claude -p "your task" --output-format stream-json

The CLI believes it is talking to Anthropic. Inference flows to GLM-5.2 through OpenRouter. GLM now gets the exact same harness as Claude, the same tools, the same iteration budget. "Same model, full harness versus flat tool-call" becomes a real, measurable comparison instead of a guess.

Two constraints this respects, and they matter if you run a governed setup:

No Anthropic SDK anywhere. It is the CLI, a subprocess, not import anthropic. That keeps the account-safety posture intact.
No direct vendor account for GLM. Inference still flows through OpenRouter, so the cost trail and the provider policy stay consistent.

Codex is the same idea with a different transport. codex exec --json is its own agentic loop, so I drive it directly and parse its JSON event stream rather than proxying it. The principle holds: give every model a real agentic loop, then score what it builds, not what it describes.

If you want only the GLM-5.2 recipe and the cheapest route into a harness, I split that into a focused how-to: GLM-5.2 through the Claude harness via OpenRouter.

Then the lanes started failing, one at a time

A benchmark is a brutal stress test of the layer underneath it. Not the models. The execution layer: process spawning, isolation, streams, worktrees, credits. Running five model families on dozens of tasks under concurrency surfaced bugs that never show up in a single happy-path dispatch. Here are the ones worth your time.

The Codex stderr pipe deadlock

This one was a real production bug, not a test artifact.

codex exec --json was spawned with stderr=PIPE, but the stream drainer only ever read stdout. Codex runs at high reasoning effort, which emits a lot of stderr. The OS pipe buffer fills at around 64 KB, the write blocks, and Codex deadlocks. The harness saw rc=1 at a few seconds with no deliverable and recorded a failure. It looked like launch flakiness. It was a classic undrained-pipe deadlock.

The fix is one daemon thread that drains stderr to a log:

python

# spawn codex with BOTH streams drained, not just stdout
threading.Thread(target=_drain, args=(proc.stderr, err_buf), daemon=True).start()

Any Codex worker that ever emits more than 64 KB of stderr would hit this. The benchmark just made it happen on every heavy cell, so it was impossible to ignore.

Isolation leaks into the committed seed

Unsandboxed CLI workers run with the repository as their working directory. On from-scratch tasks they would navigate up and write into the main checkout's committed task seed. One worker's output contaminated the next cell's input.

The fix was to move every worker's git worktree outside the repository root, so repo-relative navigation cannot reach the main checkout. Plus a fail-loud guard that refuses to run a cell in the shared checkout at all, and a file lock to serialize worktree add and remove so parallel git operations stop tripping over each other.

Lesson: "isolated" is not a property you assume, it is one you fail loudly on when it is violated.

Scoring a deliverable when the process exits non-zero

The GLM-through-harness lane produced correct work but the CLI exited rc=1. The receipt validator did not recognize the new provider, so the governance step raised, and the whole dispatch crashed with a non-zero code before a receipt was ever written. For weeks I read that as "the model does not close the loop cleanly." It was a one-line allow-list that did not list the lane.

That fix unlocked a subtler one. Once you score deliverables on a non-zero exit, an immediate-exit cell that produced nothing would get scored against the untouched seed and earn a bogus baseline. So scoring-on-failure has to be gated on real wall-clock: if the cell ran for two seconds and made no model call, it is a DNF, not a near-miss. Three small fixes, one clean scoreboard.

Quota is the silent killer

Every provider has a different wall, and each one fails in a way that mimics a code bug:

Codex yields a few cells per window, then immediate-exits at 5 to 16 seconds. Account rate limit, not a crash.
Kimi gives roughly nine cells, then hard-stops with a 429.
GLM on OpenRouter returns a 402 the moment the request reserves more max_tokens than the key's remaining daily limit can afford. The intermittency is not random: the daily budget drains through the day, so cells that pass at 09:00 fail at 13:00.

The harness has to tell these apart from genuine incapacity. A model that fails for quota gets re-run in a fresh window. A model that runs the full task and still produces nothing is real signal you keep. Conflate the two and your benchmark lies in both directions.

The best debugging moment of the whole effort came from this. I dispatched GLM, Kimi and DeepSeek as agents to inspect the failing lanes and propose fixes. The GLM agent could not finish, and its own failure report contained the exact 402 with the message "you requested 32000 tokens, can only afford 4026, adjust the key's daily limit." The model debugged its own lane by failing in a legible way.

What the numbers say

The plumbing is the point of this post, but the numbers are why the plumbing matters. The composite below is 0 to 5, weighted: 40% correctness and 20% completeness from a deterministic verify.py (pytest, SQL constraints, adversarial matrices), 15% cost and 15% speed, and 10% code-quality from a cross-provider judge panel. 442 canonical cells, 14 lanes, 15 tasks across six tiers. Every cell is the median composite of its replications; N is the cell count, and a low N means read the direction, not the decimal.

The quality matrix (median composite per lane × tier)

Lane	T1 trivial	T2 medium	T3 complex	T4 frontier	T5 review/design	T6 real review
Claude Opus 4.6	4.88 (N=9)	4.99 (N=9)	4.47 (N=6)	4.19 (N=4)	4.22 (N=4)	4.38 (N=4)
Claude Opus 4.7	4.98 (N=9)	4.97 (N=9)	4.47 (N=6)	4.16 (N=4)	4.39 (N=4)	4.38 (N=4)
Claude Opus 4.8	4.97 (N=9)	4.96 (N=9)	4.70 (N=6)	4.12 (N=4)	4.34 (N=4)	4.37 (N=4)
Claude Sonnet 4.6	4.97 (N=9)	4.97 (N=9)	4.72 (N=6)	4.17 (N=4)	4.21 (N=4)	4.39 (N=4)
Codex GPT-5.4	4.45 (N=9)	0.00 (N=9)	0.00 (N=6)	0.00 (N=4)	0.00 (N=4)	—
Codex GPT-5.5	4.45 (N=9)	4.43 (N=9)	4.45 (N=6)	4.12 (N=4)	0.00 (N=4)	4.40 (N=4)
DeepSeek V4 Flash (harness)	4.71 (N=9)	4.96 (N=9)	4.36 (N=6)	4.60 (N=4)	4.25 (N=4)	4.45 (N=4)
DeepSeek V4 Pro (harness)	4.98 (N=9)	4.97 (N=9)	4.89 (N=6)	4.69 (N=4)	4.29 (N=4)	4.44 (N=4)
GLM-5	4.88 (N=9)	4.86 (N=9)	4.87 (N=6)	4.36 (N=4)	4.17 (N=4)	4.34 (N=4)
GLM-5.1	4.86 (N=9)	4.93 (N=9)	4.72 (N=6)	0.72 (N=4)	4.07 (N=4)	4.43 (N=4)
GLM-5.2 (flat)	4.94 (N=9)	4.93 (N=9)	4.25 (N=6)	0.69 (N=4)	4.09 (N=4)	1.77 (N=4)
GLM-5.2 (harness)	4.49 (N=1)	—	4.46 (N=4)	4.15 (N=2)	4.38 (N=2)	4.37 (N=4)
Kimi K2.6	4.75 (N=1)	—	—	—	—	—
Kimi K2.7	4.98 (N=9)	4.96 (N=9)	4.46 (N=6)	4.03 (N=4)	4.36 (N=4)	4.42 (N=4)

The trivial and medium tiers compress: almost everything that runs lands above 4.8, so they do not discriminate. The work happens from T3 onward, and on T4-T6 the harnessed open lanes (DeepSeek Pro 4.69/4.29/4.44, DeepSeek Flash 4.60/4.25/4.45) sit shoulder to shoulder with the Claude lanes. The two 0.69/0.72 cells in the GLM flat and 5.1 columns on T4 are the flat-runner collapse that finding 1 is about, and the same lanes recover to 4.15+ once harnessed.

Where models actually separate (lane × discriminating task)

Lane	State-machine SSE	Path-sandbox (sec)	Mock-introspection	Agent-engine design	Real-review A	Real-review B
Claude Opus 4.6	4.89	3.90	4.47	4.42	4.38	4.38
Claude Opus 4.7	4.99	3.88	4.47	4.39	4.37	4.38
Claude Opus 4.8	4.98	3.80	4.70	4.34	4.38	4.37
Claude Sonnet 4.6	5.00	3.92	4.47	4.40	4.40	4.38
Codex GPT-5.4	0.00	0.00	0.00	0.00	—	—
Codex GPT-5.5	4.49	3.86	4.42	0.00	4.42	2.19
DeepSeek V4 Flash (harness)	4.50	4.21	2.49	4.45	4.45	4.45
DeepSeek V4 Pro (harness)	4.95	4.43	4.71	4.42	4.44	4.44
GLM-5	4.89	3.97	4.95	4.29	4.34	3.18
GLM-5.1	4.74	1.44	4.96	2.02	4.43	3.15
GLM-5.2 (flat)	2.50	2.59	4.14	2.94	3.02	1.74
GLM-5.2 (harness)	4.16	3.92	4.46	4.45	4.36	4.37
Kimi K2.7	4.49	3.55	4.46	4.45	4.41	4.43

Four findings stand out.

1. The harness reveals capability the flat runner hides

This is the cleanest result in the whole matrix. The same model, z-ai/glm-5.2 via OpenRouter, scored two ways: once through a thin agentic runner, once through the full claude CLI harness via the proxy above. Each cell is the median composite over its replications.

Task (discriminating)	GLM-5.2 flat	N	GLM-5.2 harness	N	Δ
State-machine + SSE	2.50	2	4.16	2	+1.66
Path-sandbox (security)	2.59	2	3.92	1	+1.33
Mock-introspection trap	4.14	2	4.46	2	+0.32
Agent-engine system design	2.94	2	4.45	1	+1.52
Real codebase review A	3.02	2	4.36	2	+1.33
Real codebase review B	1.74	2	4.37	2	+2.62

Same weights, same endpoint, same tasks. The only thing that changed is the harness, and a mid-pack runner climbed into the top cluster on every discriminating task. The lift is largest exactly where the work is hardest: +2.62 on a real codebase review, +1.66 on a state-machine build. The flat runner's weakness is also a variance story. On real-review A it was bimodal across reps (one run 4.29, one 1.76, median 3.02) while the harness held steady at 4.36. A benchmark that only ran the flat call would have filed GLM-5.2 under "weak at coding" and been confidently wrong. The harness N is thin (one to two reps per cell), so read these as a direction, not a decimal.

2. Harnessed open models reach Claude-tier on the hard tasks

The discriminating tasks are where models separate. On agent-engine system design, the top cluster is DeepSeek-flash-harness, GLM-5.2-harness and Kimi K2.7 at 4.45, DeepSeek-pro-harness and Claude Opus 4.6 at 4.42, Claude Sonnet 4.6 at 4.40, and the other Claude Opus lanes between 4.34 and 4.39. On the two real-codebase reviews the same harnessed open lanes plus Kimi K2.7 cluster between 4.34 and 4.45, shoulder to shoulder with Claude.

The frontier gap on review and design work is far smaller than a one-shot leaderboard suggests, as long as the harness is equal. On the trivial tier the picture is different: there the Claude and Kimi lanes lead at 4.97-4.98 while a harnessed open lane like DeepSeek-flash sits lower at 4.71. The lift is concentrated exactly where it should be, on the work that needs read, test and fix, not on a one-file mechanical edit. One honest dent in the open-lane story: DeepSeek-flash cratered to 2.49 on the mock-introspection trap (N=2), the one place a Claude or DeepSeek-pro lane clearly held and it did not.

3. Lane reliability is not the same as model capability

The Codex columns carry 0.00 cells, and they are the most misread numbers in the matrix. The split below, counted over every run not just the canonical ones, is the only honest way to read them.

Lane	Attempts	Completed	Launch-fail	Capability-fail	Launch-rate
Claude Opus 4.6	63	58	4	1	0.94
Claude Opus 4.7	63	58	4	1	0.94
Claude Opus 4.8	69	63	4	2	0.94
Claude Sonnet 4.6	66	58	8	0	0.88
Codex GPT-5.4	54	19	8	27	0.85
Codex GPT-5.5	114	41	31	42	0.73
DeepSeek V4 Flash (harness)	116	55	61	0	0.47
DeepSeek V4 Pro (harness)	109	65	44	0	0.60
GLM-5	86	64	22	0	0.74
GLM-5.1	146	95	47	4	0.68
GLM-5.2 (flat)	96	63	31	2	0.68
GLM-5.2 (harness)	17	14	2	1	0.88
Kimi K2.6	2	1	1	0	0.50
Kimi K2.7	129	79	32	18	0.75

launch-rate is the fraction that started cleanly; completion is lower again because some runs launched and then failed. The two Codex lanes split apart here. GPT-5.5 launched 73% of the time and, when it landed, scored with the leaders: 4.42 on real-review A, 4.45 on state-machine. But it completed only 41 of 114 attempts, so a thin matrix easily shows a zero where there simply was no finished run. GPT-5.4 is a harder story: 27 capability-fails and zeros across T2 through T5 mean it genuinely did not complete the harder tiers, not just that the launcher choked. So the Codex zeros are a blend, mostly completion-reliability for 5.5 and real non-completion for 5.4, and neither should be read as "Codex is bad at review" when the cells it did finish landed around 4.4. [^codex]

The same standard runs through the whole benchmark: a cell is a real score only when a genuine run finished, and the launch table is published next to the quality matrix so a zero can never quietly masquerade as a verdict. DeepSeek-flash makes the point from the other side: a 0.47 launch-rate that still produced top-cluster scores on the cells that finished.

4. Cost and speed are a separate axis from quality

Quality clustering and cost clustering are not the same chart.

Lane	Total $	Median $/cell (metered)	$/composite-point	Note
Claude Opus 4.6	0.0000	—	0.00000	subscription: $0 = not metered
Claude Opus 4.7	0.0000	—	0.00000	subscription: $0 = not metered
Claude Opus 4.8	0.0000	—	0.00000	subscription: $0 = not metered
Claude Sonnet 4.6	0.0000	—	0.00000	subscription: $0 = not metered
Codex GPT-5.4	1.4002	1.4002	0.02977	metered (Codex CLI)
Codex GPT-5.5	1.0529	1.0529	0.00800	metered (Codex CLI)
DeepSeek V4 Flash (harness)	0.0162	0.0081	0.00010	metered (DeepSeek API)
DeepSeek V4 Pro (harness)	0.0379	0.0189	0.00023	metered (DeepSeek API)
GLM-5	2.4305	0.0072	0.01529	metered (OpenRouter)
GLM-5.1	0.9956	0.0038	0.00714	metered (OpenRouter)
GLM-5.2 (flat)	3.3895	0.0027	0.02482	metered (OpenRouter)
GLM-5.2 (harness)	0.0000	—	0.00000	capture-gap: metered, not logged
Kimi K2.6	0.0000	—	0.00000	subscription: $0 = not metered
Kimi K2.7	0.0000	—	0.00000	subscription: $0 = not metered

The subscription lanes (Claude, Kimi-via-CLI) report $0, which means "not metered," not "free." [^cost] GLM-5.2-harness is a genuine capture-gap: metered via OpenRouter, but the harness lane did not log it. Among the lanes you can actually see a bill for, the standout is not a frontier name. DeepSeek-flash earns a composite point for $0.00010 and DeepSeek-pro for $0.00023, two orders of magnitude under the Codex lanes at $0.008 to $0.030 per point, while landing 4.44-4.45 on the real reviews. The per-cell metered cost runs from fractions of a cent on the DeepSeek and GLM lanes to around a dollar on Codex.

Speed is the thinnest axis, because token-rate capture was intermittent on the subscription and harness lanes.

Lane	Median tokens/s (N measured)	Median wallclock (s)
Claude Opus 4.8	18.6 (N=2)	269
DeepSeek V4 Pro (harness)	56.5 (N=1)	116
GLM-5	36.2 (N=26)	124
GLM-5.1	29.9 (N=24)	193
GLM-5.2 (flat)	14.5 (N=25)	251
Claude Opus 4.6	— (0 measured)	106
Claude Sonnet 4.6	— (0 measured)	137
Codex GPT-5.4	— (0 measured)	13
Kimi K2.7	— (0 measured)	146

Read this one with the launch table open. The only robust tokens/s numbers are the GLM lanes (N=24-26): GLM-5 at 36 tok/s, the flat GLM-5.2 runner a sluggish 14.5. Wallclock is a trap on the unreliable lanes: Codex GPT-5.4's 13-second median is not speed, it is failures exiting early, since most of its cells never finished. The honest takeaway on this axis is narrow: where it was measured, the harnessed open lanes are not slower than the frontier, and the flat GLM runner is both the weakest and the slowest.

The point for anyone choosing a model: the cheapest lane that reaches ~4.4 on real code review is not a Claude lane. If your workload is review and design rather than the hardest agentic builds, a harnessed open model can land in the same quality cluster at a metered cost you can actually see and control.

The caveats

The caveats are not a disclaimer, they are the credibility. A benchmark that hides its failure modes is worse than no benchmark.

Small n. Tiers run two to three replications. This is a high-signal, low-volume design built to discriminate, not to produce confidence intervals. Read the per-rep spread before turning a cell into a ranking. The GLM-5.2 runner bimodality is a good reminder that variance is itself a finding.
The judge is an LLM, for 10% of the score. Only code_quality is model-judged, by an Opus-plus-Kimi panel that averages the two and flags disagreement above 1.5. The other 90% of the composite is deterministic verify.py.
Codex zeros are non-completions, not a review verdict. A blend of launch failures and, for GPT-5.4, a genuine capability wall on the harder tiers. The launch table is published so the split is visible. Repeated here because it is the easiest number to misquote.
Cost zeros are unmeasured, not free. Subscription lanes (Claude, Kimi) and the GLM-5.2-harness capture-gap all read $0 for different reasons. Do not compare cost across metered and unmetered lanes as like-for-like.
One task carries a known artifact. The SSRF async-fetch task has a cross-lane filename-convergence effect, so I treat that single column as lower-confidence wherever it appears. [^ssrf]
t6 is two private codebases, about 300K+ lines of code combined. Only aggregate scores are published, no code and no project names, labelled real-review A and B. [^t6]

[^codex]: Codex 0.00 cells are non-completions. GPT-5.5 launched ~73% of attempts but completed ~36% (41 of 114); GPT-5.4 also hit a capability wall on T2-T5 (27 capability-fails). When a run finished it scored ~4.4 on review. See the launch table for the full split. [^cost]: Subscription lanes (Claude, Kimi-via-CLI) report no per-call usage, so cost reads 0. That is "not measured," never "free." [^ssrf]: The 08_ssrf task has a known cross-lane filename-convergence effect and is treated as lower-confidence. [^t6]: t6 task definitions live in a gitignored overlay and are never published. Only anonymized aggregate scores are shared.

What I would tell a peer building this

A few things generalize beyond my setup.

Bridge, do not rebuild. If you already trust one agentic harness, put a local protocol-translating proxy in front of your other models instead of maintaining a second harness. One code path to harden is worth a lot.

Treat the execution layer as the system under test. The models are fine. The spawns, the streams, the worktrees and the credit checks are where the bugs live. Drain every pipe. Isolate outside the repo. Fail loud on a broken invariant.

Separate infrastructure failure from incapacity, in code. Wall-clock, exit code and error class are enough to bucket "re-run this" versus "record this as a real zero." Without that split, quota noise and model weakness look identical.

Keep a glass-box trail. Every dispatch writes a structured report and a receipt with the git ref it ran against. When a number looks wrong at 02:00, the audit trail is the difference between a five-minute check and a re-run of the whole matrix.

Why a fair benchmark costs a week

This benchmark took almost a week, and that is the point, not an apology. The scoring was never the hard part. Getting five model families to run inside one harness fairly, so a low score means the model could not do the work and never that the plumbing failed, is what consumed the days. A weekend benchmark measures whichever model your plumbing happened to favor. A fair one costs a week, and the week is most of the value.

My proof for that claim is not a screenshot. It is 2.800+ receipts across 1.500+ analyzed sessions in a governance-first runtime, every dispatch traceable to the git ref it ran against.

The harness, the lanes and the proxy bridge are open source. The full methodology, with every fairness mechanism cited down to the file:line, is in scripts/benchmark/field-tests/METHODOLOGY.md in Vinix24/vnx-orchestration. Every published number traces from raw.csv through scorer.py to a task's verify.py. If a figure here does not reproduce from those, treat it as wrong.

One follow-up from this work: the GLM-5.2 OpenRouter recipe for the cheapest route into a harness.

Vincent van Deth

AI Strategy & Architecture

I build production systems with AI — and I've spent the last six months figuring out what it actually takes to run them safely at scale.

My focus is AI Strategy & Architecture: designing multi-agent workflows, building governance infrastructure, and helping organisations move from AI experiments to auditable, production-grade systems. I'm the creator of VNX, an open-source governance layer for multi-agent AI that enforces human approval gates, append-only audit trails, and evidence-based task closure.

Based in the Netherlands. I write about what I build — including the failures.

LinkedIn Email GitHub

One Harness, Five Model Families: Running Codex, GLM, Kimi and DeepSeek Through Claude's Agentic Loop