External Watcher Pattern for AI Agents

Every framework I evaluated had the same blind spot: they trusted the agent to report what it did.

"Here's what I executed," the agent would say. "Here's what happened. Here's what I'm doing next."

And most systems just... believed it. They built audit trails from self-reported logs. They made governance decisions based on what the agent claimed to have done. They committed to quality gates that depended entirely on the agent being honest about its own performance.

That's not observation. That's taking testimony from the defendant and calling it due process.

When I started building VNX, the governance layer behind my multi-agent AI orchestration, I realized that the moment my system became complex enough to need governance, I could no longer afford to trust an agent's self-report. Not because agents are dishonest, but because they're fundamentally limited in what they can see about their own behavior.

An AI agent executing code doesn't know if the code changed what it intended to change. It doesn't know if a subtle bug is cascading through downstream systems. It doesn't know if it made a decision based on a hallucination that won't surface for weeks.

It can report what it tried to do. But that's not the same as what it actually did.

The External Watcher Pattern is how I solved this.

The Broken Model: Self-Reporting Observability

Let me be concrete about what fails with agent self-reporting.

In a typical multi-agent system, Agent A completes a task and says: "Done. I created three files. I ran the tests. All passed. Here's my report."

Agent B trusts that report and builds on it: "Got it. You created three files. I'll integrate them now."

Then the integration breaks. Why? Because Agent A didn't know that one of the three files had a syntax error that only showed up when imported a certain way. Agent A ran tests on its own modules. It never tested the integration point.

Agent A's self-report was honest but incomplete. The system failed anyway.

This happens constantly in AI orchestration:

An agent runs a test suite and reports success, but a performance regression isn't caught by the test harness
An agent claims it validated input, but downstream the input causes an edge case failure
An agent reports a dependency update as safe, but it created a subtle version mismatch
An agent submits code for review, and in its report says "validated against requirements," but the actual files in git don't match what it claimed to validate

The pattern is always the same: the agent reports based on what it knows about, not what it actually affected.

The Dual-Input Bridge: How External Watching Works

The External Watcher Pattern breaks this cycle by observing agents through two independent channels:

Input Channel 1: Agent Hooks (when available): structured callbacks from the agent execution environment that emit real-time signals about what the agent is doing
Input Channel 2: Filesystem Watching (always works): a neutral observer that watches what actually changed in the system, regardless of what the agent claims

These two channels feed into a single unified observation stream. When they align, you have confidence. When they diverge, you've found a critical gap.

Here's the actual architecture from VNX:

javascript

Agent Execution → Hook Events (if available)
                       ↓
                 Receipt Processor V4
                       ↓
            Dual-Input Validator Bridge
                  ↙            ↘
        Hook Reports    Filesystem Truth
        (What agent      (What actually
         claims)         happened)
                  ↖            ↙
            Unified Receipt (conflict-resolved)
                       ↓
            VNX Governance Pipeline

The Receipt Processor V4 is the engine. It monitors two things simultaneously:

Hook channel: Real-time JSON events from the agent's execution environment (if it supports hooks)
Filesystem channel: File modifications, creations, deletions captured by a neutral watcher

Then it runs a simple but powerful conflict-detection algorithm: "Did the agent report it created a file? Does that file actually exist? Are the contents what the agent said they'd be?"

If yes to all three, the receipt passes. If any misalignment exists, the receipt is flagged and escalated to the next quality gate.

Real Implementation: receipt_processor_v4.sh

Let me show you the actual code that powers this in VNX.

The Receipt Processor runs continuously and monitors the unified reports directory:

bash

#!/bin/bash
# receipt_processor_v4.sh
# Watches .vnx-data/unified_reports/*.md and generates receipts

REPORTS_DIR=".vnx-data/unified_reports"
RECEIPTS_DIR=".vnx-data/receipts"
HOOK_EVENTS=".vnx-data/hook_events"

while true; do
  for report in "$REPORTS_DIR"/*.md; do
    if [[ ! -f "$RECEIPTS_DIR/$(basename "$report" .md).json" ]]; then
      
      # Parse the markdown report
      python3 report_parser.py "$report" > /tmp/parsed.json
      
      # If hook events exist for this task, compare them
      if [[ -f "$HOOK_EVENTS/$(basename "$report" .md).json" ]]; then
        HOOK_DATA=$(cat "$HOOK_EVENTS/$(basename "$report" .md).json")
        
        # Run dual-input validation
        jq -n \
          --slurpfile report /tmp/parsed.json \
          --argjson hooks "$HOOK_DATA" \
          '{
            timestamp: now | todate,
            task_id: $report[0].task_id,
            reported_changes: $report[0].files_modified,
            hook_signals: $hooks.events,
            validation: (
              if ($report[0].files_modified | length) == 
                 ($hooks.events | map(select(.type == "file_change")) | length)
              then "ALIGNED"
              else "DIVERGENT"
              end
            ),
            status: "PROCESSED"
          }' > "$RECEIPTS_DIR/$(basename "$report" .md).json"
      else
        # No hooks available, trust filesystem observation
        jq -n \
          --slurpfile report /tmp/parsed.json \
          '{
            timestamp: now | todate,
            task_id: $report[0].task_id,
            validation_method: "FILESYSTEM_ONLY",
            status: "PROCESSED"
          }' > "$RECEIPTS_DIR/$(basename "$report" .md).json"
      fi
    fi
  done
  sleep 5
done

The key insight: this doesn't need hook events to work. It degrades gracefully. If the agent environment doesn't provide hooks, the watcher relies entirely on filesystem truth. But when hooks are available, they provide that second channel of validation.

One refinement since this script was first written: receipts are now provider-aware. Each carries the provider, model, and token or cost data for the dispatch. The watcher uses this to correlate filesystem truth with lane-specific behavior. A mismatch from a kimi CLI session might indicate a different root cause than the same mismatch from a claude subprocess. The filesystem truth is universal, but the provider metadata helps me interpret it faster.

The report_parser.py script extracts the actual agent claims from markdown:

python

#!/usr/bin/env python3
# report_parser.py - Extract structured data from markdown reports

import sys
import re
import json

def parse_report(markdown_content):
    """Extract task claims, file changes, validation steps"""
    
    report = {
        "task_id": None,
        "files_modified": [],
        "tests_run": [],
        "validation_steps": [],
        "claimed_status": None
    }
    
    # Find task ID
    match = re.search(r'Task ID: (T[0-3]-\d+)', markdown_content)
    if match:
        report["task_id"] = match.group(1)
    
    # Find all "modified X" claims
    for match in re.finditer(r'Modified: `([^`]+)`', markdown_content):
        report["files_modified"].append(match.group(1))
    
    # Find test execution claims
    for match in re.finditer(r'Test: ([^\n]+) → (PASS|FAIL)', markdown_content):
        report["tests_run"].append({
            "test": match.group(1),
            "result": match.group(2)
        })
    
    # Extract status claim
    match = re.search(r'Status: (COMPLETE|FAILED|ESCALATED)', markdown_content)
    if match:
        report["claimed_status"] = match.group(1)
    
    return report

if __name__ == "__main__":
    with open(sys.argv[1], 'r') as f:
        markdown = f.read()
    
    parsed = parse_report(markdown)
    print(json.dumps([parsed]))

This is deliberately simple. It's not trying to be intelligent. It's extracting claims from the agent's own words, then comparing those claims against what actually happened on the filesystem.

Multi-Provider Dispatch: Provider-Neutral Observation

One of the design requirements for VNX was supporting multiple AI providers without special cases. A task might be dispatched to:

Claude Code via /skill-name
Claude via direct API
Codex CLI with $skill-name
Gemini CLI with @skill-name

The External Watcher doesn't care. It doesn't need to integrate with each provider's specific logging. It just watches the filesystem.

When a dispatch is created, VNX records:

json

{
  "dispatch_id": "D-2026-0305-001",
  "created_at": "2026-03-05T10:30:00Z",
  "assigned_to": "T1",
  "provider": "claude_code",  
  "skill": "refactor_component",
  "baseline_files_hash": "a3f2e91...",
  "filesystem_snapshot": {
    "src/components/": [list of files and hashes],
    "src/lib/": [list of files and hashes]
  }
}

The watcher then monitors those exact files. When the provider's agent completes execution, it generates a unified report (markdown). The Receipt Processor reads that report, extracts claims, and compares:

Files the agent claims to have modified → Files that actually changed in the filesystem
Tests the agent claims to have run → Existence of test artifacts, test logs
Dependencies the agent says it verified → Lock files, package versions
Performance targets the agent says it met → Benchmark logs (if they exist)

The provider-neutral part is critical: I don't need to parse Claude's thinking blocks, or Codex's execution logs, or Gemini's intermediate outputs. Those are implementation details. What matters is the filesystem truth, and that's universal across all providers.

As of June 2026, VNX runs six worker lanes: claude claude -p subprocess, ephemeral tmux-spawn, codex CLI, kimi CLI, gemini CLI, and deepseek-harness. The claude-subprocess lane is the proven reference. The ephemeral tmux-spawn lane is becoming the default, driven by the June 15 Anthropic billing change that moves headless claude -p to paid API credits while interactive Claude Code stays on subscription. A fresh interactive window per dispatch preserves the subscription model. The External Watcher does not care which lane ran the task. It watches the same filesystem regardless.

What the Watcher Catches That Self-Reporting Misses

Let me give you concrete examples from running VNX in production.

Example 1: The Silent Dependency Regression

Agent claim from unified report: "Updated next.config.mjs and validated against package.json. All dependencies aligned."

Filesystem watcher finds: Next.js updated from 15.0.3 to 15.1.0, but the lockfile wasn't regenerated. The claim was honest; the agent did validate the config. But it didn't know that package managers were configured to allow minor version bumps, and the dev environment would install 15.1.0 while CI would install 15.0.3.

The dual-input bridge flagged this: "Hook events show dependency check passed, but filesystem shows version mismatch in lockfile timestamp vs package.json modification time." This goes to the quality gate, not automatically approved.

Example 2: The Phantom Test Pass

Agent claim: "Created test suite. All 12 tests passed."

Filesystem watcher found: Test file was created. Test artifact file existed. But the test artifact was from yesterday's run, not today's. The agent ran the test command, saw an exit code 0 from a cached result, reported success, and moved on.

The watcher compared the test artifact's mtime with the task start time. Divergence detected. Escalated.

Example 3: The Partial File Modification

Agent claim: "Modified src/components/Header.tsx with the changes requested."

Filesystem watcher found: File was modified (✓). But 80% of the file remained identical to the version before. The agent made surgical changes correctly, but the claim was "modified the component," which humans interpret as "rewrote it." The neutral watcher simply reported: "Header.tsx changed by 4.2% of line count."

The next agent reviewing this code saw the honest statistic and knew exactly how much of the file was actually changed.

Example 4: The Idle Worker

When I first ran a real task through the new ephemeral tmux-spawn lane, the governance worked but the worker sat idle. The instruction was pasted into the interactive prompt but never submitted, and the readiness check still looked for an old "Welcome to Claude" banner that Claude Code v2.1.159 no longer prints. The agent self-reported nothing because it had not actually started. The watcher caught this by noticing zero filesystem changes for three minutes past the expected start time, while the dispatch state showed "RUNNING" with no matching filesystem activity. The gap between claimed state and ground truth was the signal. I fixed the submit logic and the banner detection. That is the kind of bug you only find by dogfooding your own default lane.

These are the gaps where self-reporting is either blind or prone to interpretation mismatch. The External Watcher doesn't judge these situations; it just makes them visible.

📖 Read also: One Terminal to Rule Them All: the full multi-agent setup that the external watcher monitors

Integration with the Broader VNX Pipeline

The Receipt Processor's output feeds directly into the quality gates. The VNX Supervisor monitors all processes:

bash

# vnx_supervisor_simple.sh watches all receipts and escalates divergences

while true; do
  for receipt in .vnx-data/receipts/*.json; do
    if jq -e '.validation == "DIVERGENT"' "$receipt" > /dev/null; then
      
      # Divergence detected - escalate to quality gate
      TASK_ID=$(jq -r '.task_id' "$receipt")
      echo "ESCALATION: $TASK_ID - Hook events and filesystem diverged" \
        >> .vnx-data/quality-gates/escalations.log
      
      # Block automatic approval
      echo "BLOCKED" > ".vnx-data/tasks/$TASK_ID/approval_status"
    fi
  done
  sleep 5
done

The T0 Orchestrator reviews these escalations. It doesn't just see "approval blocked." It sees the actual divergence data, can pull the agent's full report, can inspect the filesystem diff, and makes an informed decision.

Honest Limitations

This approach isn't perfect, and I want to be clear about what it can't do.

It can't detect logical errors. If an agent modifies a file and the file syntax is correct but the logic is wrong, the watcher will report "file modified successfully." The logical error requires code review or testing at the next gate.

It depends on filesystem coherence. If the underlying storage is flaky or if operations happen too quickly (race conditions), the watcher might miss changes. In practice, this is rare, but it's a theoretical limitation.

It requires stable task isolation. If Agent A and Agent B are modifying the same files simultaneously, the watcher can't reliably attribute changes. This is why VNX enforces task-level file locking.

Hook events add latency. If available, they're useful, but they're not free. They add network I/O and processing overhead. The system gracefully degrades to filesystem-only if hooks become too slow.

False positives are possible but rare. A file might be modified outside the agent's execution (manual edit, concurrent process). The watcher will flag this, and the quality gate team investigates. This is a feature, not a bug; it makes you aware of assumptions your system was making.

The Philosophical Core

The External Watcher Pattern is built on a simple principle: external observation beats self-reporting at scale.

When you have one AI agent handling a simple task, self-reporting works fine. When you have four parallel agents, each spawning sub-agents, each modifying shared infrastructure, each making claims about what they did; self-reporting becomes a liability.

The watcher doesn't care if the agent is trustworthy. It cares about ground truth. It watches the filesystem, compares it against claims, and makes divergences visible.

In the next part of this series, I'll cover how these divergences feed into the async quality gates: the decision points where humans and AI work together to determine whether an agent's work gets approved or escalated.

Update: June 2026

VNX reached 1.0 code-freeze in early June. The lane model is now the default architecture: six worker lanes from claude subprocess to deepseek-harness, all feeding the same dual-input bridge. The ephemeral tmux-spawn lane, driven by the June 15 Anthropic billing change, is maturing but still honest about its rough edges. Receipts are provider-aware and hash-chained. The core principle has not changed: external observation beats self-reporting. The implementation has just gotten more specific.

For now, the key takeaway is this: if you're building multi-agent AI systems, you need to observe them from the outside. Don't trust the agent to audit itself. Build a watcher.

The full VNX orchestration system, including the external watcher, receipt processor, and dual-input bridge, is open source on GitHub.

📖 Read also: From Human-in-the-Loop to Human-on-the-Loop: How the external watcher enables graduated agent autonomy

📖 Read also: Async Quality Gates: Why AI Agents Don't Get to Decide When They're Done: The decision points where humans and AI work together to determine whether an agent's work gets approved

This is Part 6 of the Glass Box Governance series.

📚 Glass Box Governance series

One Terminal to Rule Them All: How I Orchestrate Claude, Codex, and Gemini Without Them Knowing About Each Other

Receipts, Not Chat Logs: What 2,472 AI Agent Dispatches Taught Me About Governance

The Cascade of Doom: When AI Agents Hallucinate in Chains

Why I Chose NDJSON Over Postgres for My AI Agent Audit Trail

Claude Agent Teams vs. Building Your Own: What Anthropic Solved (And What They Left Out)

The External Watcher Pattern: How I Observe AI Agents Without Trusting Their Self-Reports ← you are here

Why Architecture Beats Models: Lessons from 2400+ AI Agent Dispatches ← you are here

Async Quality Gates: Why AI Agents Don't Get to Decide When They're Done

From Human-in-the-Loop to Human-on-the-Loop: A Production Graduation Path

Traceability as Architecture: Designing AI Systems Where Every Decision Has a Receipt

Decision-Making Architecture: Why Autonomous Agents Need Governance, Not Just Instructions

Context Rotation at Scale: How VNX Keeps AI Agents Honest After 10,000 Dispatches

Autonomous Agent Patterns: 5 Production-Tested Approaches for Agents That Run Without You

Governance Scoring: How to Measure Whether Your AI Agent Deserves More Autonomy

Vincent van Deth

AI Strategy & Architecture

I build production systems with AI — and I've spent the last six months figuring out what it actually takes to run them safely at scale.

My focus is AI Strategy & Architecture: designing multi-agent workflows, building governance infrastructure, and helping organisations move from AI experiments to auditable, production-grade systems. I'm the creator of VNX, an open-source governance layer for multi-agent AI that enforces human approval gates, append-only audit trails, and evidence-based task closure.

Based in the Netherlands. I write about what I build — including the failures.

LinkedIn Email GitHub

The External Watcher Pattern: How I Observe AI Agents Without Trusting Their Self-Reports

The Broken Model: Self-Reporting Observability

The Dual-Input Bridge: How External Watching Works

Real Implementation: receipt_processor_v4.sh

Multi-Provider Dispatch: Provider-Neutral Observation

What the Watcher Catches That Self-Reporting Misses

Integration with the Broader VNX Pipeline

Honest Limitations

The Philosophical Core

Update: June 2026

Vincent van Deth

Gerelateerde artikelen

AI in de zorg: waar het mag, en waar het moet wachten

Het duurste AI-model is niet automatisch het beste

Reacties

Marketing Strategie

Marketing Automatisering

AI Implementatie