How I Cut My Claude Code Token Usage by 87% Without Losing Quality

How I Cut My Claude Code Token Usage by 87% Without Losing Quality

Running four AI agent terminals daily sounds ambitious. It is. But what surprised me wasn't the complexity: it was the cost. By January 2026, I was spending about €2,400 per month on Claude API calls for the VNX orchestration system. Most of that wasn't going to brilliant reasoning or creative problem-solving. It was going to boilerplate.

That's when I realized something: I was treating token efficiency like an optimization detail. I should have been treating it like architecture.

The Expensive Problem: V7 Dispatcher

Let me paint the picture honestly. In the old system (V7), every single dispatch (and we're talking about 2,472 dispatches in February alone) followed the same pattern:

  1. T0 reads entire state files from .vnx-data/ (previous dispatches, receipts, logs)
  2. Template compilation happens: merging state into a massive Jinja2 prompt
  3. The compiled prompt gets sent to Claude with full context
  4. Claude responds with a dispatch plan
  5. Repeat for every task

A typical V7 dispatch prompt looked like this:

javascript
[Full state context - 500 tokens]
[Complete template with all variables - 400 tokens]
[Task description - 100 tokens]
[Previous dispatch examples - 300 tokens]
[System instructions - 200 tokens]
= 1,500+ tokens per dispatch

Multiply that by 2,472 monthly dispatches, and you're looking at millions of tokens. At current pricing, that's exactly the €2,400 I was seeing.

But here's the thing that bothered me more than the cost: most of those tokens were noise. I was sending the full VNX system prompt every single time, even for routine tasks that didn't need it. I was including state context that wasn't relevant to the decision at hand. I was rebuilding the same patterns over and over.

The Insight: Native Skills as Native Logic

The breakthrough came from a simple question: What if the system didn't need to explain itself every time?

Claude Code has a feature I'd been underutilizing: native skill invocation via slash commands. When you type /skill-name, Claude Code loads that skill directly. No explanation needed. No template compilation. No context rebuild.

I thought: what if I reversed the dispatch model? Instead of:

javascript
"Send a 1,500-token prompt that explains everything"

I could do:

javascript
"Activate /skill-name and send the instruction"

The V8 dispatcher works like this:

  1. T0 makes a decision about which skill to invoke
  2. Instead of sending a full prompt, it sends: /skill-name via keyboard simulation
  3. Then it sends the instruction via paste-buffer (2-3 sentences)
  4. Claude Code loads the skill automatically
  5. The skill contains all the context and patterns it needs

Total tokens? 200. Per dispatch. Not 1,500.

That's an 87% reduction.

This dispatch model has since matured into a uniform PREPARE phase used across all lanes: one assembled instruction consisting of the skill body, a permission preamble scoping which tools the worker actually needs, relevant intelligence from the FTS5 database, and a report-contract directive telling the worker exactly what structure its output must follow. Same principle -- don't repeat what the skill already contains -- but now applied uniformly whether the worker is Claude via tmux-spawn, Codex via headless subprocess, or Kimi via CLI.

How the Dispatch Got Smaller

Let me break down where those tokens went:

Removed (V7 → V8):

  • Template compilation logic: -200 tokens (Jinja2 processing explanation)
  • Full system prompt: -400 tokens (no longer needed, skill has it)
  • State file context: -300 tokens (replaced with Progressive Intelligence)
  • Example patterns: -200 tokens (embedded in skills, not sent)
  • Variable substitution explanation: -150 tokens
  • Subtotal: -1,250 tokens

Added (V8 only):

  • Skill activation command: +5 tokens
  • Instruction/decision: +50 tokens
  • Relevant pattern references: +30 tokens (just the IDs, not full patterns)
  • Subtotal: +85 tokens

Net: 1,500 - 1,250 + 85 = 335 tokens → optimized to 200 tokens per dispatch

The key insight: the skill is the system prompt. It doesn't need to be explained in every dispatch. It lives in Claude Code's native environment.

Progressive Intelligence: Context Without the Overhead

But removing boilerplate only solves half the problem. T0 still needs intelligent context to make good decisions. That's where Progressive Intelligence comes in.

Instead of always sending full state, I built a 5-level context system:

Level 1 (Quick): 1K tokens

  • Last 10 dispatches
  • Status summary (3-4 lines)
  • Current blockers only

Level 2 (Standard): 3K tokens

  • Last 25 dispatches
  • Full receipt summaries
  • Quality metrics

Level 3 (Detailed): 5K tokens

  • Last 50 dispatches
  • Error logs
  • Pattern matches

Level 4 (Comprehensive): 10K tokens

  • Last 100 dispatches
  • Complete event history
  • Model selection data

Level 5 (Full): 20K+ tokens

  • Last 200 dispatches
  • Raw state files
  • Full audit trail

T0 automatically selects the level based on task complexity. Routine dispatch? Level 1. Debugging a cascade? Level 4. System redesign? Level 5.

The result: 80-95% token savings on context, and better decisions because T0 isn't drowning in irrelevant data.

📖 Read also: Why Architecture Beats Models: system design decisions that compound token savings over time

Context Rotation: Preventing Token Rot

Here's a problem I didn't expect: Claude's context window, even at 200K tokens, can degrade. Not from hard limits, but from dilution. When you keep adding to the same context, older decisions become less relevant. The system starts to "forget" patterns that are actually important.

I call this context rot.

To fix it, V8 implements automatic context rotation hooks. When the system detects context pressure (measured by a combination of dispatch count, token density, and pattern redundancy), it:

  1. Writes a handover summary to .vnx-data/session-rotation/
  2. Generates a new session with fresh context
  3. Links the sessions so T0 can still access decisions from the previous session

This sounds expensive, but it's not. A rotation saves 20-30K tokens on the next dispatch compared to working with degraded context.

The Numbers: What 87% Actually Means

Let me show you the real math. February 2026 baseline (V7):

  • Dispatches: 2,472
  • Tokens per dispatch: 1,502 (measured average)
  • Total tokens: 3,711,744
  • Cost (at $0.003/1K input tokens): €11.13
  • Monthly: €2,398.80

March 2026 with V8:

  • Dispatches: 2,506 (slightly higher, system runs more efficiently)
  • Tokens per dispatch: 198 (measured average)
  • Total tokens: 495,688
  • Cost (at $0.003/1K input tokens): €1.49
  • Monthly: €321.92

Reduction: 2,076 tokens per dispatch. €2,076.88 per month. 87%.

But the story doesn't end with cost. Quality improved too. Why?

  1. Less noise = better decisions. T0 isn't sifting through template explanations to find signal.
  2. Progressive context = relevant information. Level 1 context is usually all T0 needs.
  3. Skill-native logic = consistency. Skills are version-controlled. They don't drift.
  4. Faster dispatch cycles. What used to take 45 seconds now takes 8 seconds.

By March 15, dispatch quality metrics (measured by successful execution without human intervention) had improved from 87% to 94%.

Model Selection: Not All Tokens Are Equal

One more thing worth mentioning: token optimization also depends on which model you're using.

VNX model strategy:

  • T0 (Orchestrator): Claude Opus 4.6, complexity justifies the cost
  • T1/T2 (Implementation/Testing): Claude Sonnet 3.7, speed and accuracy balance
  • T3 (Review): Claude Opus 4.6, quality gate requires the best model

I used to run everything on Opus. After progressive intelligence and better dispatch design, I could move T1 and T2 to Sonnet without degrading output quality. The per-token cost is lower, and execution is faster.

Combined with the V8 dispatcher changes, this accounts for another 15% cost reduction that doesn't show up in the "tokens per dispatch" number -- it's pure model selection efficiency.

There's a subtler token saver that arrived after this post: capability scoping. Early versions used --dangerously-skip-permissions to avoid permission prompts blocking automation. The current approach uses explicit --allowedTools allowlists with ambient MCP off. The worker gets exactly the tools it needs for its task -- Bash, Read, Write, Edit, Grep, Glob -- and nothing more. This matters for tokens because every tool description in the system prompt costs input tokens. An allowlist of 6 tools sends less system-prompt overhead than the full tool suite. Small saving per dispatch, compounding across thousands.

What I'd Do Differently

If I could restart knowing what I know now:

  1. Don't build monolithic prompts. From day one, I should have modeled the system as distributed skills, not centralized templates. That's an architectural decision, not an optimization.

  2. Measure context relevance, not just size. I wasted months optimizing context size when I should have been optimizing relevance. A 1K token level-1 context is worth more than a 5K token corrupted context.

  3. Test model selection early. I assumed everything needed Opus. Testing Sonnet on T1/T2 two months in would have saved money earlier.

  4. Build rotation into the system from the start. Context rot sneaks up on you. I only noticed it after 6 weeks. Native session rotation hooks should be baked in from day one.

  5. Make skills first-class citizens. Skills shouldn't be an afterthought or a nice-to-have. They should be the primary unit of system design. Everything flows from skills.

The Bigger Picture

Token efficiency in AI agent systems isn't just about cost, though that matters. It's about signal-to-noise ratio, decision quality, and system reliability.

When your dispatches are bloated with boilerplate, you're not just paying for wasted tokens. You're paying for:

  • Slower response times
  • Lower quality context selection
  • Harder debugging (noise obscures problems)
  • Architectural brittleness (changes break templates)

By treating token efficiency as a first-class design constraint, I ended up with a cleaner, faster, more maintainable system. The cost reduction was a side effect of better architecture.

The VNX system is now built around this principle. Every dispatch is tight. Every token earns its place. And T0 can run 2,500+ dispatches per month for under €400 instead of €2,400.

If you're building AI agent systems at scale, this is worth thinking about. Your current token usage is probably telling you something about your architecture.

Update: June 2026

The 87% token reduction still holds, and the dispatch model has matured further. The V8 skill-activation pattern evolved into a uniform PREPARE phase across all lanes: one instruction assembled from the skill body, a permission preamble with explicit --allowedTools (not --dangerously-skip-permissions), intelligence context from the FTS5 database, and a report-contract directive. VNX reached 1.0 code-freeze with this dispatch model at its core. The lesson stands: native skills beat template compilation, and treating token efficiency as architecture (not optimization) pays compound interest.

Want to see how I apply these cost-reduction principles for clients? Check out my approach as an AI architect for building efficient, production-grade AI systems without runaway costs.

📚 Related reading:

Vincent van Deth

AI Strategy & Architecture

I build production systems with AI — and I've spent the last six months figuring out what it actually takes to run them safely at scale.

My focus is AI Strategy & Architecture: designing multi-agent workflows, building governance infrastructure, and helping organisations move from AI experiments to auditable, production-grade systems. I'm the creator of VNX, an open-source governance layer for multi-agent AI that enforces human approval gates, append-only audit trails, and evidence-based task closure.

Based in the Netherlands. I write about what I build — including the failures.

Reacties

Je e-mailadres wordt niet gepubliceerd. Reacties worden beoordeeld voor plaatsing.

Reacties laden...