ProcBench evaluates LLM coding agents' process-level defects beyond final outcomes
New benchmark uncovers hidden failures in multi-step coding agent trajectories
ProcBench, introduced by Jiawei He and colleagues, shifts the evaluation of LLM coding agents from outcome-only metrics to process-level scrutiny. Existing benchmarks like SWE-bench measure final task completion, compilation success, or test pass rates, but ignore how the execution unfolds. ProcBench addresses this by defining an ontology of process defects, converting heterogeneous execution logs into a standardized trajectory representation, and issuing calibrated risk-based scorecards instead of binary pass/fail results. The team instantiated ProcBench on 200 annotated trajectories from three major coding-agent benchmarks: AndroidBench, TerminalBench, and SWE-bench-Verified. Their results show that process-aware scorecards significantly improve empirical interpretability compared to direct thresholding, and that ProcBench can reliably detect recurrent failures—such as loop loops, resource leaks, and permission escalation failures—that are invisible to outcome-based evaluations.
The framework also introduces the concept of control preservation: the ability of an agent to maintain correct state and decision-making over multi-step tasks. By analyzing trajectories, ProcBench can pinpoint exactly where an agent loses control, enabling developers to debug not just what the agent produced but how it got there. The authors acknowledge limitations: annotation dependence, partial observability for certain defect classes (e.g., race conditions), and the need for broader external validation across more agent types and tasks. Despite these constraints, ProcBench offers a powerful lens for improving the reliability of LLM-driven coding agents, particularly in safety-critical and long-horizon programming tasks where a single mistake can cascade into major failures.
- Introduces a reusable ontology of process-level defects for LLM coding agents, moving beyond binary pass/fail evaluation.
- Tested on 200 trajectories from three benchmarks (AndroidBench, TerminalBench, SWE-bench-Verified) with calibrated risk-based scorecards.
- Reveals recurrent failures like loop loops and permission escalation that are invisible to outcome-based benchmarks.
Why It Matters
Better evaluation of coding agents' process reliability is crucial for safe deployment in production and long-running automated development tasks.