Governance Scoring for AI Agents

Trust isn't binary. Here's how I measure it on a continuous scale.

Every agent framework talks about trust. Almost none of them measure it. They give you configuration toggles ("allow tool use: yes/no," "auto-approve: on/off") and call that governance. But trust is not a switch. It is a signal that changes over time, based on observed behavior, and it should be quantified the same way you quantify any other production metric.

After running VNX Orchestration in production for over six months with 2,400+ agent dispatches, I built a scoring system that answers one question: does this agent, on this task type, deserve more autonomy than it currently has? The answer is never yes or no. It is a number between 0 and 1, and the number moves.

This is Part 12 of the Glass Box Governance series, the final piece. It started with philosophy ("why transparency matters") and moved through architecture ("how to build observable agents"). This post is the operational capstone: how to measure trust and use those measurements to make graduation decisions.

The Problem With Binary Trust

Most agent systems treat autonomy as a configuration decision. You decide at deployment time whether an agent can execute tasks unsupervised, and that decision remains static until you manually change it. This creates two failure modes.

Over-trust: You configure an agent as autonomous because it handled ten tasks well. On task eleven, it fails silently, and you don't find out until the damage compounds. I described this exact scenario in why fully autonomous agents don't exist in practice: a content agent that ran unsupervised for three weeks before producing a blog post with fabricated statistics.

Under-trust: You keep an agent on a short leash because you don't have data to justify loosening it. Every task requires manual approval. The agent becomes an expensive autocomplete, and you spend more time approving actions than doing the work yourself. I tracked this pattern across my first month of production: 67% of my agent-related time was spent approving routine actions that had a 99.2% success rate.

Both failure modes have the same root cause: the absence of measurement. Without a quantified trust signal, you are guessing. Governance scoring eliminates the guessing.

Composite Quality Score (CQS)

The core of my scoring system is a Composite Quality Score calculated per agent, per task type, per time window. It is not a single metric. It is a weighted composition of four signals that together describe trustworthiness.

The Four Signals

1. Success Rate (weight: 0.35)

The simplest signal: what percentage of dispatches for this task type completed without errors or escalations? I measure this over a rolling 30-dispatch window, not a calendar period. Calendar periods create noise: an agent that runs 50 tasks in a week gets diluted differently than one that runs 5 tasks in a month.

javascript

success_rate = completed_without_error / total_dispatches_in_window

A 30-dispatch window means the score reflects recent performance while smoothing out individual failures. One failure in 30 dispatches drops the success rate to 0.967, noticeable but not catastrophic. Three failures drops it to 0.90: a clear signal that something changed.

2. Quality Gate Pass Rate (weight: 0.30)

Success is necessary but not sufficient. An agent can complete a task without errors while still producing mediocre output. Quality gates, automated checks that run after each dispatch, measure whether the output meets defined standards.

For SEOcrawler tasks, my quality gates check: test coverage above threshold, no linting violations, no type errors, commit message follows convention, no files modified outside the expected scope. For content tasks: readability score within range, no flagged phrases, internal links present, word count within bounds.

javascript

gate_pass_rate = gates_passed / gates_evaluated

The distinction between success rate and gate pass rate matters. A refactoring task can succeed (no crashes, tests pass) but fail quality gates (introduced code duplication, missed a type annotation). Both signals contribute to trust, but they measure different things.

3. Escalation Frequency (weight: 0.20)

How often does the agent escalate to me? This signal is counterintuitive: escalation is good behavior. An agent that escalates when uncertain is more trustworthy than one that guesses. But an agent that escalates on everything is not autonomous. It is just delegating the work back to me.

The optimal escalation frequency depends on the task type. For dependency updates, I expect near-zero escalations. For content generation, I expect 15-25%. The scoring function normalizes against the expected rate:

javascript

escalation_score = 1 - abs(actual_escalation_rate - expected_escalation_rate)

This means an agent is penalized both for escalating too much (timid) and too little (reckless). If the expected rate for content tasks is 0.20, an agent that escalates 0.50 of the time scores 0.70. An agent that never escalates scores 0.80. An agent that escalates exactly 0.20 of the time scores 1.00.

4. Recovery Behavior (weight: 0.15)

When an agent fails, what happens next? Does the failure cascade, or does the agent recover gracefully? I measure this as the ratio of failures that required manual intervention versus failures the agent resolved through built-in retry logic or fallback strategies.

javascript

recovery_score = self_recovered_failures / total_failures

If an agent has zero failures, this score defaults to 1.0, as no failures means no evidence of poor recovery. This default matters because it prevents the score from being undefined for new agents or agents with perfect records.

Computing CQS

javascript

CQS = (0.35 * success_rate) +
      (0.30 * gate_pass_rate) +
      (0.20 * escalation_score) +
      (0.15 * recovery_score)

The result is a number between 0.0 and 1.0. In practice, production agents rarely score below 0.60 (they would be decommissioned) or above 0.97 (statistical noise prevents perfect scores over sustained windows).

Here is what my production CQS scores look like across task types:

Terminal	Task Type	CQS	Window
T1	Content generation	0.84	Last 30 dispatches
T2	Dependency updates	0.96	Last 30 dispatches
T2	Test writing	0.91	Last 30 dispatches
T3	Code refactoring	0.88	Last 30 dispatches
T4	Monitoring/alerting	0.93	Last 30 dispatches

These numbers tell me exactly where to focus. Content generation at 0.84 means the agent is competent but not ready for full autonomy. Dependency updates at 0.96 means the agent has earned fire-and-forget trust. I don't guess. I read the score.

📖 Read also: VNX Intelligence System: Autonomous Monitoring That Actually Works: how the monitoring layer feeds real-time data into CQS scoring decisions.

Confidence Scoring Per Decision

CQS measures aggregate trustworthiness. But individual dispatches also need a confidence signal: how certain is the agent that its output is correct, right now, for this specific task?

I implemented a per-decision confidence score that the agent reports alongside every output. This is not the model's internal probability distribution. It is a structured self-assessment based on factors the agent can evaluate:

json

{
  "dispatch": "CONTENT-312",
  "confidence": 0.72,
  "factors": {
    "similar_tasks_completed": 24,
    "template_match": true,
    "external_dependencies": 0,
    "scope_within_bounds": true,
    "novel_elements": ["new keyword cluster", "unfamiliar CMS field"]
  },
  "recommendation": "review_before_publish"
}

The confidence score drives real-time routing decisions. High confidence (above 0.85) plus high CQS (above 0.90) means the output routes directly to production. Low confidence (below 0.70) or low CQS (below 0.80) means the output routes to my review queue. The thresholds are not arbitrary. They were calibrated against six months of production data by comparing automated confidence predictions to my actual review decisions.

Threshold Calibration

I calibrated thresholds by running a two-month experiment. Every dispatch routed through my review queue regardless of confidence or CQS. I recorded my decision (approve, reject, modify) for each dispatch and then computed the optimal thresholds that would have replicated my decisions with the highest accuracy.

The result:

CQS Range	Confidence Range	Routing	My agreement rate
>= 0.90	>= 0.85	Auto-approve	97.3%
>= 0.80	>= 0.70	Review queue	89.1%
< 0.80	Any	Manual only	100% (by design)
Any	< 0.70	Manual only	100% (by design)

The 97.3% agreement rate on auto-approved dispatches means that out of every 100 dispatches the system would auto-approve, I would have approved 97 of them. The 3 I would have caught are not catastrophic failures. They are style preferences and judgment calls that quality gates don't capture. For dependency updates and test writing, the agreement rate is 99.1%. For content tasks, it drops to 94.2%, which is why content tasks have stricter thresholds.

Graduation Criteria

CQS and confidence scores answer the question "how much do I trust this agent right now?" Graduation criteria answer a different question: "when should this agent's autonomy level change?"

I defined five autonomy levels, each with explicit entry and exit criteria:

Level 0: Supervised

Every action requires approval before execution. This is the default for new agents and new task types.

Entry criteria: Agent is new, or task type is new, or CQS dropped below 0.65.

Exit criteria (to Level 1): 15 consecutive dispatches completed with CQS above 0.75.

Level 1: Reviewed

Agent executes without pre-approval, but every output is reviewed before it reaches production.

Entry criteria: Graduated from Level 0, or CQS dropped below 0.80 from a higher level.

Exit criteria (to Level 2): 30 dispatches with CQS above 0.85 and zero critical failures.

Level 2: Spot-Checked

Agent outputs go directly to production. I review a random 25% sample within 24 hours.

Entry criteria: Graduated from Level 1.

Exit criteria (to Level 3): 50 dispatches with CQS above 0.90, confidence calibration error below 0.10, and maximum one non-critical failure.

Level 3: Audited

Agent operates fully autonomously. I review aggregate metrics weekly and individual dispatches only when flagged by quality gates.

Entry criteria: Graduated from Level 2.

Exit criteria (to Level 4): 100 dispatches with CQS above 0.93 and zero manual interventions required.

Level 4: Trusted

Agent operates with minimal oversight. Monthly metric review. Can modify its own quality gate thresholds within defined bounds.

Entry criteria: Graduated from Level 3.

No further graduation. Level 4 is the ceiling. No agent operates without any oversight. The monthly review is non-negotiable.

Demotion Rules

Graduation is not permanent. Every level has demotion triggers:

Two critical failures in a 10-dispatch window: Drop two levels.
CQS below level entry threshold for 5 consecutive dispatches: Drop one level.
Confidence calibration error above 0.25: Drop one level and recalibrate thresholds.
Any security violation (scope breach, unauthorized data access): Drop to Level 0 immediately.

The demotion rules are asymmetric by design: promotion is slow (15-100 dispatches), demotion is fast (2-5 dispatches). This reflects a fundamental principle: it takes many successes to build trust and very few failures to lose it. The same is true for human teams, and it should be true for agent teams.

Implementation Details

The scoring system runs as part of VNX Orchestration. The core implementation is straightforward:

Data source: Every dispatch writes a receipt to the NDJSON ledger. The receipt includes completion status, quality gate results, escalation events, and the agent's confidence score. The scoring system reads from this ledger.

Computation frequency: CQS recalculates after every dispatch. Graduation checks run after every CQS update. This means an agent can theoretically graduate or demote mid-session, and it has happened twice in production.

Storage: Current autonomy levels and CQS histories are stored in a JSON state file per terminal. The state file is version-controlled, which means graduation decisions are auditable through git history.

json

{
  "terminal": "T2",
  "task_type": "dependency_update",
  "current_level": 3,
  "cqs_current": 0.96,
  "cqs_history": [0.93, 0.94, 0.95, 0.96, 0.96],
  "dispatches_at_current_level": 47,
  "last_demotion": null,
  "last_promotion": "2026-03-15T09:41:22Z",
  "confidence_calibration_error": 0.04
}

Alerting: When a CQS drops more than 0.05 in a single dispatch window, I get an alert through the VNX Intelligence System. The alert includes the dispatch that caused the drop, the specific signal that degraded, and the agent's confidence score for that dispatch. This is how I caught a model regression in March: CQS for content tasks dropped from 0.86 to 0.79 over three dispatches because the model started generating longer paragraphs that failed the readability gate.

What Six Months of Scoring Taught Me

Running this system since October 2025 produced several insights that I did not anticipate.

Insight 1: Task type matters more than agent capability. The same agent (same model, same configuration) scores 0.96 on dependency updates and 0.84 on content generation. The agent's capability didn't change. The task's predictability did. Governance scoring should be per-task-type, not per-agent.

Insight 2: Confidence calibration drifts. Early in production, agent confidence scores correlated well with actual quality (calibration error of 0.06). After three months, the error drifted to 0.14 as the model encountered new task variations that it hadn't seen during the calibration period. I now recalibrate every 200 dispatches.

Insight 3: Demotion is healthy. My initial instinct was to treat demotions as failures. They are not. They are the system working correctly. Terminal T3 was demoted from Level 3 to Level 2 in January after a series of refactoring tasks produced merge conflicts. The demotion triggered a review that identified the root cause: a change in the repository's branch strategy that the agent wasn't configured to handle. After fixing the configuration, T3 re-graduated to Level 3 within two weeks. Without the demotion, the merge conflicts would have continued.

Insight 4: The 30-dispatch window is correct for most task types. I experimented with windows of 10, 20, 30, and 50. Ten-dispatch windows are too noisy: one bad dispatch swings the score dramatically. Fifty-dispatch windows are too slow: they take weeks to reflect real changes. Thirty dispatches balance responsiveness with stability. The exception is high-frequency tasks (monitoring alerts) where I use a 50-dispatch window because the dispatches arrive quickly enough that 50 is still recent data.

Insight 5: Level 4 agents exist. I was skeptical that any agent would reach Level 4 in production. Terminal T2 for dependency updates reached Level 4 in February after 100+ dispatches with a CQS of 0.95+. It has maintained Level 4 for 47 dispatches. The monthly reviews consistently confirm that its outputs are correct. This doesn't mean I trust it blindly. It means the scoring system has enough data to justify maximum autonomy for that specific task type.

Why This Matters Beyond My System

Governance scoring is not specific to VNX Orchestration. Any system that dispatches AI agents to execute tasks can implement this approach. The principles are transferable:

Measure trust as a continuous signal, not a binary toggle.
Score per task type, not per agent.
Make graduation criteria explicit and auditable.
Demote faster than you promote.
Recalibrate regularly: confidence drift is real.

The alternative is what most teams do today: make trust decisions based on intuition, adjust autonomy levels manually when something goes wrong, and operate without any historical record of why an agent has the permissions it has. That approach works until it doesn't. And when it breaks, you have no data to diagnose what went wrong or how to prevent it.

Governance scoring gives you the data. It turns "I think this agent is trustworthy" into "this agent has a CQS of 0.91 over 30 dispatches with a confidence calibration error of 0.06." The first statement is an opinion. The second is a measurement. Production systems should run on measurements.

This is Part 12 of 12 in the Glass Box Governance series. The series began with the question "why does transparency matter for multi-agent AI?" and ends here with the operational answer: measure trust, quantify autonomy, and let agents earn independence through demonstrated reliability. The architecture is open source at github.com/Vinix24/vnx-orchestration.

📖 Read also: Human-on-the-Loop: A Production Graduation Model for AI Agents: the theoretical model that governance scoring operationalizes.

📖 Read also: Autonomous Agent Patterns: 5 Production-Tested Approaches: how the patterns in this post compose into a full autonomy framework.

Sources

VNX Orchestration, production agent architecture with Glass Box Governance: github.com/Vinix24/vnx-orchestration
Anthropic, "Building effective agents" (2024), principles for agent reliability and evaluation: anthropic.com
Camunda, "2026 State of Agentic Orchestration and Automation", enterprise survey on governance gaps in agentic AI: camunda.com
NIST AI Risk Management Framework, trust measurement and calibration in AI systems: nist.gov/artificial-intelligence

📚 Glass Box Governance series

One Terminal to Rule Them All: How I Orchestrate Claude, Codex, and Gemini Without Them Knowing About Each Other

Receipts, Not Chat Logs: What 2,472 AI Agent Dispatches Taught Me About Governance

The Cascade of Doom: When AI Agents Hallucinate in Chains

Why I Chose NDJSON Over Postgres for My AI Agent Audit Trail

Claude Agent Teams vs. Building Your Own: What Anthropic Solved (And What They Left Out)

The External Watcher Pattern: How I Observe AI Agents Without Trusting Their Self-Reports

Why Architecture Beats Models: Lessons from 2400+ AI Agent Dispatches

Async Quality Gates: Why AI Agents Don't Get to Decide When They're Done

From Human-in-the-Loop to Human-on-the-Loop: A Production Graduation Path

Traceability as Architecture: Designing AI Systems Where Every Decision Has a Receipt

Decision-Making Architecture: Why Autonomous Agents Need Governance, Not Just Instructions

Context Rotation at Scale: How VNX Keeps AI Agents Honest After 10,000 Dispatches

Autonomous Agent Patterns: 5 Production-Tested Approaches for Agents That Run Without You

Governance Scoring: How to Measure Whether Your AI Agent Deserves More Autonomy ← you are here

Vincent van Deth

AI Strategy & Architecture

I build production systems with AI — and I've spent the last six months figuring out what it actually takes to run them safely at scale.

My focus is AI Strategy & Architecture: designing multi-agent workflows, building governance infrastructure, and helping organisations move from AI experiments to auditable, production-grade systems. I'm the creator of VNX, an open-source governance layer for multi-agent AI that enforces human approval gates, append-only audit trails, and evidence-based task closure.

Based in the Netherlands. I write about what I build — including the failures.

LinkedIn Email GitHub

Governance Scoring: How to Measure Whether Your AI Agent Deserves More Autonomy

The Problem With Binary Trust

Composite Quality Score (CQS)

The Four Signals

Computing CQS

Confidence Scoring Per Decision

Threshold Calibration

Graduation Criteria

Level 0: Supervised

Level 1: Reviewed

Level 2: Spot-Checked

Level 3: Audited

Level 4: Trusted

Demotion Rules

Implementation Details

What Six Months of Scoring Taught Me

Why This Matters Beyond My System

Sources

Vincent van Deth

Gerelateerde artikelen

AI in de bouw: van weken calculeren naar dagen

AI-kwaliteit meten: in één keer goed, of het moet over

Reacties

Marketing Strategie

Marketing Automatisering

AI Implementatie