Developer Tools

Willful Disobedience: Automatically Detecting Failures in Agentic Traces

New AI-powered system analyzes 424 agentic traces to catch procedural errors that outcome-only benchmarks miss.

Deep Dive

Researchers from Microsoft Research and UC Berkeley have introduced AgentPex, a novel AI-powered tool designed to systematically detect failures in AI agent execution histories, known as agentic traces. As AI agents become embedded in real software systems to handle multi-step workflows through dialogue, tool calls, and decisions, their long execution histories make validation challenging. Traditional outcome-only benchmarks often miss critical procedural failures like incorrect workflow routing, unsafe tool usage, or violations of prompt-specified rules. AgentPex addresses this by automatically extracting behavioral rules from agent prompts and system instructions, then using these specifications to evaluate traces for compliance.

The team evaluated AgentPex on 424 agentic traces from the τ²-bench benchmark across customer service domains including telecom, retail, and airlines. Their results demonstrate that AgentPex effectively distinguishes agent behavior across different AI models and surfaces specification violations that outcome-only scoring fails to capture. The tool provides fine-grained analysis by domain and metric, allowing developers to understand agent strengths and weaknesses systematically. This represents a significant advancement in AI agent testing, moving beyond simple pass/fail metrics to detailed procedural analysis that can improve agent reliability and safety in production environments.

Key Points
  • AgentPex analyzes 424 agentic traces from τ²-bench across telecom, retail, and airline domains
  • The tool extracts behavioral rules from prompts to detect procedural failures traditional benchmarks miss
  • Provides fine-grained analysis enabling developers to debug agent behavior at scale

Why It Matters

Enables systematic testing of AI agents in production systems, catching dangerous procedural errors before they cause real-world harm.