The Real Cost of AI Agents in Production (And How I Cut Mine by 87%)
Here's what nobody tells you about AI agents: building them is the cheap part.
I run 11 AI agents in production. Content generation, SEO analysis, lead qualification, code review — the full stack. In December 2025, my combined LLM bill hit $2,847. For a solopreneur operation. That number forced me to either shut agents down or fundamentally rethink how they consume tokens.
I chose the second option. Three months later, the same workload runs for $370/month.
This post is the full breakdown. Not theory — actual production numbers.
Here's the actual math. With ~2,400 dispatches per month:
Before optimization (everything on Opus):
| Dispatches | Cost | |
|---|---|---|
| All tasks → Opus | 2,400 | ~$2,847/mo |
After multi-model routing:
| Model | % | Dispatches | Cost |
|---|---|---|---|
| Opus (architecture, reviews) | 10% | 240 | ~$180 |
| Sonnet (code, standard) | 50% | 1,200 | ~$135 |
| Codex (implementation) | 30% | 720 | ~$48 |
| Flash (classification) | 10% | 240 | ~$7 |
| Total | 2,400 | ~$370/mo |

The V8 dispatcher itself contributes to this: by reducing per-dispatch overhead from ~1,500 tokens (V7 template compilation) to ~200 tokens (native skill activation), the orchestration layer costs almost nothing.
The Uncomfortable Truth About AI Agent Costs
The FinOps Foundation reported that AI spending doubled in enterprise environments in 2025, yet only 63% of organizations actually track their AI spend. The rest are flying blind.
Here's what makes agent costs particularly nasty: they compound in ways you don't expect.
A single agent call isn't expensive. But agents retry. They chain. They stuff context windows with conversation history. They call tools that generate more tokens. My content generation agent, for example, was burning through 180K tokens per blog post — not because the output was long, but because of the reasoning chain, tool calls, and quality validation loops behind it.
Enterprise LLM spending hit $8.4 billion in the first half of 2025 alone. Nearly 40% of enterprises now spend over $250,000 annually on language models. And that's just inference — not counting development, fine-tuning, or the human time spent debugging agent behavior.
The real cost driver? Sending every request to your most capable (and most expensive) model.
My Cost Breakdown: Before Optimization
Here's what my 11 agents were costing me in December 2025, before any optimization:
| Agent Category | Model Used | Monthly Tokens | Monthly Cost |
|---|---|---|---|
| Content generation (3 agents) | Claude Opus | ~12M tokens | $1,080 |
| Code review & refactoring | Claude Opus | ~8M tokens | $720 |
| SEO analysis & crawling | Claude Sonnet | ~15M tokens | $450 |
| Lead qualification | Claude Sonnet | ~6M tokens | $180 |
| Email drafting | Claude Opus | ~3M tokens | $270 |
| Data extraction & formatting | Claude Sonnet | ~4M tokens | $120 |
| Monitoring & alerts | Claude Haiku | ~1M tokens | $27 |
| Total | ~49M tokens | $2,847 |

The pattern was obvious once I looked at the data: 72% of my spend went to Opus calls. But when I audited what those Opus calls were actually doing, roughly 60% of them were tasks that didn't need Opus-level reasoning. Formatting outputs. Extracting structured data from clear inputs. Writing first drafts that would be edited anyway.
I was paying premium prices for commodity work.
The Three Levers That Cut Costs by 87%
Lever 1: Multi-Model Routing
This is where the biggest savings came from. The concept is straightforward — route each request to the cheapest model that can handle it at acceptable quality.
RouteLLM research from LMSYS showed this approach can achieve 95% of GPT-4 performance while only sending 26% of requests to the premium model. My results were similar.
Here's how I restructured the routing:
Tier 1 — Haiku ($0.25/$1.25 per 1M tokens): Formatting, classification, extraction from structured data, monitoring checks, simple summarization. About 45% of all requests.
Tier 2 — Sonnet ($3/$15 per 1M tokens): SEO analysis, lead qualification scoring, code review for standard patterns, email drafts, RAG-powered responses. About 40% of requests.
Tier 3 — Opus ($15/$75 per 1M tokens): Complex multi-step reasoning, architectural decisions, novel content creation, edge cases escalated from lower tiers. Only 15% of requests.
The key insight: this isn't about quality compromise. It's about matching capability to complexity. You don't need a senior architect to format JSON.
I wrote about the orchestration layer behind this in my post on multi-model orchestration from a single terminal. The routing logic itself is simple — a classifier that evaluates task complexity before dispatching.
Lever 2: Quality Gates as Cost Controls
Here's the counterintuitive part: adding validation steps reduced total cost.
Before optimization, my agents would retry on failure with the same expensive model. A content agent that produced a subpar draft would regenerate the entire thing — another 40K tokens through Opus.
Now, async quality gates validate output at each step. If a Haiku-generated extraction has issues, it gets escalated to Sonnet — not retried at the same tier. If a Sonnet draft needs deeper reasoning on one section, only that section goes to Opus.
This cascading approach means failures are cheap. A bad Haiku output costs $0.002. Escalating it to Sonnet costs $0.05. The old approach — retrying the whole chain through Opus — cost $2.40 per retry.
My retry costs dropped from ~$580/month to ~$45/month.
Lever 3: Context Window Management
Token usage isn't just about which model you call. It's about how much you send per call.
Three changes made the biggest difference:
-
Aggressive context pruning. My agents were stuffing full conversation histories into every call. Now they send a compressed summary plus only the last 2 relevant turns. This alone cut average tokens-per-call by 40%.
-
Output schema enforcement. Instead of letting models generate verbose explanations, I enforce structured output schemas. The model returns JSON, not prose. Shorter outputs = fewer output tokens = lower cost (output tokens are 3-5x more expensive than input tokens).
-
Semantic caching. For repeated or near-identical queries (common in SEO analysis and data extraction), I cache responses and serve them without an API call. Cache hit rate: ~23%.
I documented the token reduction patterns in detail in my post on token usage reduction.
The Model Routing Decision Tree
This is the actual decision logic my orchestration layer uses:
1. Is the task structured data transformation?
→ YES: Haiku
→ NO: Continue
2. Does the task require multi-step reasoning?
→ NO: Does it need current context understanding?
→ NO: Haiku
→ YES: Sonnet
→ YES: Continue
3. Is the task novel/creative or architecturally complex?
→ NO: Sonnet
→ YES: Opus
4. Post-execution quality check:
→ PASS: Done
→ FAIL: Escalate one tier up, reprocess failed section onlyThe classifier that evaluates these conditions is itself a Haiku call — costing fractions of a cent per routing decision. The routing overhead is negligible compared to the savings.
What matters most: the escalation path. When a lower-tier model fails, you don't restart the entire chain. You escalate the specific failed component. This is what I mean when I say architecture beats models — the system design determines your cost structure more than your model choice does.
What This Means for Your Architecture
After optimization, my cost breakdown looks like this:
| Agent Category | Routing Strategy | Monthly Cost |
|---|---|---|
| Content generation | Haiku draft → Sonnet edit → Opus for key sections | $185 |
| Code review & refactoring | Sonnet primary, Opus escalation | $95 |
| SEO analysis & crawling | Haiku extraction, Sonnet analysis | $38 |
| Lead qualification | Haiku scoring, Sonnet edge cases | $18 |
| Email drafting | Sonnet primary | $14 |
| Data extraction & formatting | Haiku with schema enforcement | $8 |
| Monitoring & alerts | Haiku + caching | $12 |
| Total | $370 |
That's an 87% reduction. Same 11 agents. Same output quality — validated through blind comparison tests where I couldn't distinguish the optimized output from the all-Opus output in 94% of cases.
If you're building agents for production, here's what I'd prioritize:
-
Instrument everything. You can't optimize what you don't measure. Log token usage per agent, per task type, per model. The patterns will be obvious once you have data.
-
Start with routing. Multi-model routing delivers the biggest ROI with the least architectural change. Even a simple "is this task complex?" classifier saves 40-60% immediately.
-
Add quality gates before adding models. A cheap model with good validation outperforms an expensive model with no checks. Every time.
-
Treat output tokens as premium. They cost 3-5x more than input tokens. Enforce structured outputs. Cut the prose.
For a deeper look at how I structure these systems, check my AI architectuur service page — or read the full series on building production AI systems.
Read also: Architecture Beats Models: Lessons from AI Agent Dispatches — why system design matters more than model selection.
Frequently Asked Questions
Does multi-model routing actually maintain quality? In my blind testing, 94% of outputs from the optimized routing pipeline were indistinguishable from the all-Opus baseline. The 6% that differed were edge cases in creative content — and even those were acceptable after one Opus escalation pass.
What's the minimum number of agents where routing makes sense? Even with a single agent, routing pays off if that agent handles diverse task types. The break-even point for building a routing layer is roughly $200/month in LLM spend. Below that, the engineering time isn't worth it.
How do you handle latency with multi-model routing? Smaller models are actually faster. Haiku responds in 200-400ms vs. 2-4 seconds for Opus. My average response time improved by 60% after routing, because most requests now hit faster models.
What about fine-tuning instead of routing? Fine-tuning locks you into one model provider and requires ongoing maintenance as your use cases evolve. Routing is model-agnostic and adapts instantly when you add new models or pricing changes. I find routing more practical for operations under 50 agents.
Can I implement this with open-source models instead of API calls? Absolutely. The routing logic is model-agnostic. Replace Haiku/Sonnet/Opus with Qwen-0.5B/Qwen-7B/Qwen-72B on your own infrastructure and the same principles apply. Your cost savings shift from API fees to GPU compute, but the routing architecture stays identical.
What tools do you use for monitoring LLM costs? I built a custom dashboard that pulls token counts from API logs and calculates costs per agent per day. For teams just starting out, tools like LangSmith, Portkey, or Helicone give you this visibility out of the box. The important thing is that you track per-agent, per-task-type — aggregate numbers hide the real optimization opportunities.
Vincent van Deth
AI Strategy & Architecture
I build production systems with AI — and I've spent the last six months figuring out what it actually takes to run them safely at scale.
My focus is AI Strategy & Architecture: designing multi-agent workflows, building governance infrastructure, and helping organisations move from AI experiments to auditable, production-grade systems. I'm the creator of VNX, an open-source governance layer for multi-agent AI that enforces human approval gates, append-only audit trails, and evidence-based task closure.
Based in the Netherlands. I write about what I build — including the failures.