Beyond Task Success: An Evidence-Synthesis Framework for Evaluating, Governing, and Orchestrating Agentic AI
New paper identifies critical 'governance-to-action gap' in current AI agent safety approaches.
Researchers Christopher Koch and Joshua Andreas Wellbrock have published a significant analysis titled 'Beyond Task Success: An Evidence-Synthesis Framework for Evaluating, Governing, and Orchestrating Agentic AI.' The paper synthesizes findings from 24 recent sources, revealing a critical flaw in how we manage AI agents—systems like AutoGPT or Devin that plan, use tools, and execute multi-step workflows. The core problem is a 'governance-to-action closure gap': current evaluation tells us if an outcome was good, and governance defines what should be allowed, but neither specifies how to bind rules to concrete actions or prove compliance afterward. This leaves a dangerous void for agents performing actions with real-world effects.
The authors propose a unified framework to close this gap. It consists of three key artifacts: a four-layer model spanning evaluation, governance, orchestration, and assurance; an 'ODTA' runtime test based on Observability, Decidability, Timeliness, and Attestability to decide where to place governance checks; and a 'minimum action-evidence bundle' to log proof for state-changing actions. Their analysis of the literature shows that evaluation research identifies safety and robustness as open issues, governance frameworks lack execution logic, and orchestration studies point to the control plane as the critical point for policy enforcement. Crucially, they note that text alignment in models like GPT-4 does not reliably transfer to safe tool use.
A practical scenario involving an enterprise procurement agent illustrates the framework's application. It demonstrates how these artifacts consolidate existing evidence to provide a coherent path for governing complex, autonomous AI systems, moving beyond simplistic benchmarks to ensure trustworthy deployment where it matters most—at the moment of action.
- Identifies a 'governance-to-action gap' where rules aren't linked to executable controls in AI agents.
- Proposes a 4-layer framework and an 'ODTA' test (Observability, Decidability, Timeliness, Attestability) for runtime governance.
- Highlights that text alignment (e.g., in GPT-4) does not ensure safe tool use, necessitating new evidence and control methods.
Why It Matters
Provides a critical roadmap for safely deploying autonomous AI agents in enterprise workflows where actions have real consequences.