Developer Tools

TRAJEVAL: Decomposing Code Agent Trajectories for Fine-Grained Diagnosis

New research reveals AI code agents waste time, examining 22x more functions than necessary to fix bugs.

Deep Dive

A research team from Columbia University, the University of Virginia, and ServiceNow has published a new paper introducing TRAJEVAL, a framework designed to solve a critical blind spot in AI development. Currently, when an autonomous code agent fails to resolve a GitHub issue, evaluation is limited to a simple pass/fail metric like Pass@1. This provides no insight into where in the complex process—searching for files, reading code, or making edits—the agent actually went wrong. TRAJEVAL addresses this by decomposing an agent's entire execution trajectory into these three interpretable stages and computing precision and recall for each by comparing against a reference patch.

In a massive analysis of 16,758 trajectories across three agent architectures and seven models, TRAJEVAL uncovered both universal problems and model-specific weaknesses. The study found that all agents examined approximately 22 times more functions than were necessary, highlighting a massive inefficiency. However, failure modes differed: GPT-5 was good at locating relevant code but often targeted its edits incorrectly, while Qwen-32B frequently failed at the initial file discovery stage entirely.

The power of TRAJEVAL lies in its dual utility for diagnosis and improvement. The framework proved highly predictive, achieving model-level Pass@1 prediction within a 0.87-2.1% Mean Absolute Error. More importantly, it's actionable. By providing real-time feedback based on the trajectory signals it identifies, researchers were able to improve two state-of-the-art models by 2.2 to 4.6 percentage points while simultaneously reducing inference costs by 20-31%. This moves AI agent evaluation from simplistic outcome-based benchmarking to a mechanism-driven science of failure, enabling targeted optimizations.

Key Points
  • Diagnoses agent failure in 3 stages: search (file localization), read (function comprehension), and edit (modification targeting).
  • Found agents are universally inefficient, examining 22x more functions than needed to fix a bug.
  • Actionable feedback improved top models' performance by up to 4.6% while reducing costs by 20-31%.

Why It Matters

Provides a scientific method to debug and optimize expensive AI coding agents, directly improving their success rate and cost-efficiency for developers.