April Build Log: AI Production Metrics

30 days. 30 blogs. [TOTAL_DISPATCHES] dispatches. Here is what actually happened when I ran a multi-agent production system at full capacity for an entire month — the numbers nobody publishes because they make AI look less magical than the marketing suggests.

This is the first monthly build log for VNX Orchestration. Every number comes from production logs. Nothing is rounded to sound impressive. Where something failed, I say what failed and why.

If you run AI agents in production — or plan to — this is the data I wish someone had published before I started.

The Numbers

Here is the raw summary for April 2026.

Metric	Value
Total dispatches	[TOTAL_DISPATCHES]
Successful dispatches	[SUCCESSFUL_DISPATCHES]
Failed dispatches	[FAILED_DISPATCHES]
First-pass yield	[FIRST_PASS_YIELD]%
Total tokens consumed	[TOTAL_TOKENS]
Average tokens per dispatch	[AVG_TOKENS_PER_DISPATCH]
Average quality score	[AVG_QUALITY_SCORE]/10
Median dispatch duration	[MEDIAN_DISPATCH_DURATION]
Total governance events	[TOTAL_GOVERNANCE_EVENTS]

[COMMENTARY_ON_NUMBERS — 2-3 sentences interpreting the key takeaways from the table. What surprised you? What confirmed your expectations?]

The first-pass yield is the metric I care about most. It measures how many dispatches passed all quality gates without requiring a rework cycle. In manufacturing, anything above 90% is considered world-class. For AI agents, I do not have an industry benchmark yet — which is part of why I am publishing this.

Content Output

April was the first month where I committed to publishing a blog post every single day. Here is what the content pipeline produced:

Content type	Count	Notes
Blog posts (published)	30	Mix of NL (MKB) and EN (authority)
LinkedIn posts	[LINKEDIN_POST_COUNT]	[LINKEDIN_NOTES]
Intelligence reports	[INTELLIGENCE_REPORT_COUNT]	Daily briefings from HN/Reddit/GitHub
Email sequences drafted	[EMAIL_SEQUENCE_COUNT]	[EMAIL_NOTES]
Cover images generated	[COVER_IMAGE_COUNT]	DALL-E + HTML overlay pipeline

[CONTENT_OUTPUT_COMMENTARY — What was the hardest part of daily publishing? What would you do differently? Did quality drop toward the end of the month?]

The daily publishing commitment forced a level of pipeline reliability that I would not have reached organically. When you know a blog has to go out tomorrow, every manual step in the workflow becomes unacceptable friction.

Governance Events

The governance layer exists to catch problems before they reach production output. Here is what it caught in April.

Event Summary

Event type	Count	Severity
Quality gate failures	[QUALITY_GATE_FAILURES]	[SEVERITY_DISTRIBUTION]
SPC alerts (3-sigma)	[SPC_ALERT_COUNT]	[SPC_SEVERITY]
Context rotation triggers	[CONTEXT_ROTATION_COUNT]	Info
Cost threshold warnings	[COST_THRESHOLD_COUNT]	Warning
Pattern confidence decays	[PATTERN_DECAY_COUNT]	Info
[OTHER_EVENT_TYPE]	[OTHER_EVENT_COUNT]	[OTHER_SEVERITY]

[GOVERNANCE_COMMENTARY — Which event type was most common? Were there any patterns in when violations occurred (time of day, type of content, specific agent)? Any false positives worth mentioning?]

Notable Incidents

Incident 1: [INCIDENT_1_TITLE]

Date: [INCIDENT_1_DATE]
What happened: [INCIDENT_1_DESCRIPTION]
Root cause: [INCIDENT_1_ROOT_CAUSE]
Resolution: [INCIDENT_1_RESOLUTION]
Time to resolution: [INCIDENT_1_TTR]

Incident 2: [INCIDENT_2_TITLE]

Date: [INCIDENT_2_DATE]
What happened: [INCIDENT_2_DESCRIPTION]
Root cause: [INCIDENT_2_ROOT_CAUSE]
Resolution: [INCIDENT_2_RESOLUTION]
Time to resolution: [INCIDENT_2_TTR]

[ADD_MORE_INCIDENTS_AS_NEEDED]

Every incident above is reconstructed from the append-only NDJSON audit trail. If you want to understand why I chose that format over a database, I wrote about it in an earlier post about the real cost of running AI agents in production.

What Worked

Four things delivered disproportionate value this month.

1. Context Rotation

[CONTEXT_ROTATION_DETAILS — How many rotations happened? What was the average context utilization before rotation? Did rotation frequency change over the month?]

Context rotation remains the single highest-ROI investment in the system. Without it, quality degrades predictably after [X] dispatches as the context window fills with stale conversation history. The rotation mechanism resets the agent with a fresh context while preserving the intelligence layer — patterns, confidence scores, and prevention rules carry over. The agent forgets the conversation but keeps the lessons.

I wrote about the architecture behind this in how the VNX intelligence system works.

2. Quality Gates as Hard Stops

[QUALITY_GATE_DETAILS — How many dispatches were caught by quality gates? What was the most common failure reason? Example of a gate catch that prevented a real problem.]

Quality gates are not suggestions in my system. They are hard stops. If a dispatch scores below [MINIMUM_QUALITY_THRESHOLD] on any dimension, it does not proceed. It gets flagged, logged, and queued for rework. This is expensive — rework dispatches cost tokens — but it is cheaper than publishing garbage.

3. [WHAT_WORKED_3_TITLE]

[WHAT_WORKED_3_DETAILS — Specific example, metrics, and why it mattered.]

4. [WHAT_WORKED_4_TITLE]

[WHAT_WORKED_4_DETAILS — Specific example, metrics, and why it mattered.]

What Failed

Not everything worked. Here are the failures worth documenting.

1. [FAILURE_1_TITLE]

[FAILURE_1_DETAILS — What went wrong, when you noticed, what the impact was, and how you fixed it. Include specific metrics: how many dispatches were affected, what the quality impact was, how long it took to fix.]

2. [FAILURE_2_TITLE]

[FAILURE_2_DETAILS]

3. [FAILURE_3_TITLE]

[FAILURE_3_DETAILS]

[ADD_MORE_FAILURES_AS_NEEDED — Be honest. The value of a build log is in the failures, not the successes.]

Cost Breakdown

Here is what April cost to run.

Cost category	Amount	% of total
Anthropic API (Claude)	$[ANTHROPIC_COST]	[ANTHROPIC_PCT]%
OpenAI API (DALL-E, embeddings)	$[OPENAI_COST]	[OPENAI_PCT]%
Infrastructure (Supabase, hosting)	$[INFRA_COST]	[INFRA_PCT]%
[OTHER_COST_CATEGORY]	$[OTHER_COST]	[OTHER_PCT]%
Total	$[TOTAL_COST]	100%

Cost per published blog post: $[COST_PER_BLOG] Cost per dispatch: $[COST_PER_DISPATCH] Cost per 1K tokens: $[COST_PER_1K_TOKENS]

[COST_COMMENTARY — How does this compare to March? To your expectations? Is the trend sustainable? Where is the biggest optimization opportunity?]

For context, I track these costs against the alternative: hiring a content team. A single full-time content writer in the Netherlands costs between EUR 3,000 and EUR 4,500 per month. A content strategist adds another EUR 4,000-6,000. My entire AI production system — including the intelligence layer, governance, and all API costs — runs for [COMPARISON_STATEMENT].

That comparison is not perfect. The AI system cannot do everything a human team can. But for structured, research-backed content production at this volume, the economics are not close.

Quality Trends

Quality did not stay flat across the month. Here is what happened.

Week 1 (Apr 1-7): [WEEK_1_AVG_QUALITY] average quality score. [WEEK_1_NOTES]

Week 2 (Apr 8-14): [WEEK_2_AVG_QUALITY] average quality score. [WEEK_2_NOTES]

Week 3 (Apr 15-21): [WEEK_3_AVG_QUALITY] average quality score. [WEEK_3_NOTES]

Week 4 (Apr 22-30): [WEEK_4_AVG_QUALITY] average quality score. [WEEK_4_NOTES]

[QUALITY_TREND_COMMENTARY — Did quality improve as the intelligence layer learned? Was there a mid-month dip? What drove the changes?]

What I Would Change for May

Three things I am changing based on April's data.

1. [CHANGE_1_TITLE]

[CHANGE_1_DETAILS — What is the problem, what is the planned fix, and what metric will tell you if it worked.]

2. [CHANGE_2_TITLE]

[CHANGE_2_DETAILS]

3. [CHANGE_3_TITLE]

[CHANGE_3_DETAILS]

The Meta Observation

There is something uncomfortable about publishing your production metrics every month. Every number is a surface someone can criticize. Every failure is documented proof that the system is imperfect.

I do it anyway because the alternative is worse.

The AI industry runs on vibes. "Our agents are amazing." "AI is transforming everything." "Results may vary." No numbers. No failure modes. No cost data. Just promises wrapped in demo videos.

I built VNX Orchestration to be the opposite of that. Open source. Production data published monthly. Failures documented alongside successes. If my system is going to claim 87% cost savings, you should be able to verify that claim against real dispatch data and real invoices.

This is the first monthly build log. May's will follow on May 31. If you run AI agents in production and want to compare notes, the GitHub repo has everything — including the scripts that generate these metrics.

[CLOSING_THOUGHT — One sentence reflection on what April taught you about running AI in production. Something specific, not generic.]

Read also: Intelligence Beats Memory: Why Your AI Agents Need a Self-Learning Pipeline — The architecture behind the intelligence layer referenced throughout this build log.

Read also: The Real Cost of AI Agents in Production — Detailed cost analysis and the pricing model that makes multi-agent systems economically viable.

Sources

VNX Orchestration — GitHub (open source, production code)
[ADDITIONAL_SOURCE_1]
[ADDITIONAL_SOURCE_2]
[ADDITIONAL_SOURCE_3 — Add any papers, articles, or tools referenced in the filled-in sections]

Vincent van Deth

AI Strategy & Architecture

I build production systems with AI — and I've spent the last six months figuring out what it actually takes to run them safely at scale.

My focus is AI Strategy & Architecture: designing multi-agent workflows, building governance infrastructure, and helping organisations move from AI experiments to auditable, production-grade systems. I'm the creator of VNX, an open-source governance layer for multi-agent AI that enforces human approval gates, append-only audit trails, and evidence-based task closure.

Based in the Netherlands. I write about what I build — including the failures.

LinkedIn Email GitHub

April Build Log: 30 Days, 10,000+ Dispatches — Metrics, Failures, and What I Learned