GLM-5.2 Through the Claude Harness: The OpenRouter Route

GLM-5.2 looks mediocre at coding, if you call it the way most benchmarks do.

Run the exact same model through a real agentic tool-harness instead of a flat call and it jumps to frontier-Claude tier. Same weights, same OpenRouter endpoint, same tasks. The only thing that changed is the harness. If you are evaluating GLM-5.2, or any OpenRouter model, for coding, the harness is the variable you are probably getting wrong.

This is the focused recipe. For the full 14-lane benchmark across five model families and the lane-hardening story behind it, see one harness, five model families.

The one comparison this whole post is about

z-ai/glm-5.2 via OpenRouter, two ways. The flat runner is a thin read/write/run loop. The harness is the full claude CLI, pointed at GLM-5.2 through a local litellm proxy. Composite 0 to 5, median of the replications (N is one to two per cell, so read the direction, not the decimal).

TaskGLM-5.2 flat runnerGLM-5.2 via Claude harnessΔ
State-machine + SSE (complex)2.504.16+1.66
Path-sandbox / SSRF-style security2.593.92+1.33
Mock-introspection trap (debug)4.144.46+0.32
Planted-defect code review4.094.31+0.22
Agent-engine system design2.944.45+1.52
Real codebase review A (private, 300K+ LOC)3.024.36+1.33
Real codebase review B (private)1.744.37+2.62

Two stories in one table. The harness raises the score on the hard tasks, and it removes the variance. The flat runner's real-review-A result swung between 4.29 and 1.76 across reps (median 3.02); the harness held flat at 4.36. The biggest single lift is +2.62 on a real codebase review, exactly the read-run-fix work a flat one-shot call cannot do. A model you would have filed under "not good enough for real work" lands in the top cluster once it can read, run and fix.

The how-to: GLM-5.2 in the Claude Code CLI

Three steps. The whole trick is that the claude CLI talks to an Anthropic /v1/messages endpoint and lets you choose where, so you put a translating proxy in between.

1. Run a local litellm proxy that exposes an Anthropic-compatible /v1/messages endpoint, backed by GLM-5.2 on OpenRouter.

yaml
# litellm config: Anthropic-shaped front, OpenRouter GLM-5.2 back
model_list:
  - model_name: glm-5.2
    litellm_params:
      model: openrouter/z-ai/glm-5.2
      api_key: os.environ/OPENROUTER_API_KEY
litellm_settings:
  drop_params: true          # tolerate Anthropic params OpenRouter does not map 1:1
general_settings:
  master_key: sk-local-only  # a LOCAL bearer, not an OAuth subscription token
bash
litellm --config glm-proxy.yaml --port 4141

2. Point the claude CLI at the proxy. Use a local bearer token, not an OAuth subscription token.

bash
ANTHROPIC_BASE_URL=http://localhost:4141 \
ANTHROPIC_AUTH_TOKEN=sk-local-only \
claude -p "your task" --output-format stream-json

3. Drive your normal agentic loop. Inference flows OpenRouter to GLM-5.2; the harness is byte-identical to what Claude itself runs. Tools, file I/O, iteration, sub-agents, all of it.

The same recipe generalizes to any OpenRouter-served model. Swap the model_name entry and you have Qwen, Llama or anything else OpenRouter serves running inside the same loop. (In my own benchmark DeepSeek ran on its own key rather than OpenRouter, but the proxy pattern is identical.)

Why this is safe, and why the routing stays consistent

Two things make this more than a hack.

The routing stays consistent. GLM still goes through OpenRouter, with no direct z.ai or Zhipu account, so cost tracking and the model registry stay aligned with the rest of your stack.

The account stays safe. It is the CLI driving a subprocess, not the Anthropic SDK, and the proxy authenticates with a local key, not an OAuth subscription token. Anthropic has cracked down on subscription OAuth tokens used in third-party tools, including the Agent SDK, as a terms-of-service violation. A CLI-driven, local-key setup avoids that entire class of risk.

The counter-data

The harness is not free magic, and pretending otherwise would make this post hype instead of useful.

On a trivial yaml refactor the harness actually scored lower than the flat runner, 1.74 against 2.99. On one subtle-bugfix task the harness only reached 2.19. The lift is concentrated on tasks that need a read, test and fix loop. On a one-file mechanical edit there is nothing for the loop to add, and the extra machinery can even get in the way. If your workload is bulk mechanical edits, the harness is not your win. If it is real implementation, security boundaries, design and review, it is a large one.

Caveats

  • Small n. Two replications on the hard tasks. The variance is itself a finding, so read the spread before treating a cell as a ranking.
  • One review task runs against private production codebases. Only aggregate scores are published, no code and no names.
  • The proxy adds latency and a translation layer between the Anthropic tool format and OpenRouter. A model that handles tool-calling poorly will not benefit from the harness no matter what you put in front of it.
  • drop_params: true is doing real work. It tolerates Anthropic parameters OpenRouter does not map one to one. Without it, some calls fail at the proxy.

The proxy config, the runners and the scoring are open source in Vinix24/vnx-orchestration. The source-referenced methodology, with every fairness mechanism cited down to the file:line, lives in scripts/benchmark/field-tests/METHODOLOGY.md. Every number in the table above traces from raw.csv through scorer.py to a task's verify.py.

The takeaway is small and load-bearing. Before you conclude a model is weak at coding, check whether you measured the model or your harness. With GLM-5.2 the difference between those two is the difference between bottom tier and frontier tier.

Vincent van Deth

AI Strategy & Architecture

I build production systems with AI — and I've spent the last six months figuring out what it actually takes to run them safely at scale.

My focus is AI Strategy & Architecture: designing multi-agent workflows, building governance infrastructure, and helping organisations move from AI experiments to auditable, production-grade systems. I'm the creator of VNX, an open-source governance layer for multi-agent AI that enforces human approval gates, append-only audit trails, and evidence-based task closure.

Based in the Netherlands. I write about what I build — including the failures.

Reacties

Je e-mailadres wordt niet gepubliceerd. Reacties worden beoordeeld voor plaatsing.

Reacties laden...