Developer Tools

When the Specification Emerges: Benchmarking Faithfulness Loss in Long-Horizon Coding Agents

arXiv cs.SE March 19, 2026

⚡AI coding assistants like Claude Code and Codex struggle when specifications emerge gradually, losing up to 90% implementation fidelity.

Deep Dive

A new research paper titled 'When the Specification Emerges: Benchmarking Faithfulness Loss in Long-Horizon Coding Agents' reveals a critical weakness in current AI coding assistants. Researchers Lu Yan, Xuan Chen, and Xiangyu Zhang created the SLUMP (Faithfulness Loss Under Emergent Specification) benchmark to test how well agents like Claude Code and Codex handle real-world coding scenarios where requirements aren't fully known upfront but emerge through interaction—mirroring actual research and development workflows.

The benchmark contains 20 recent ML papers from ICML 2025 and NeurIPS 2025, broken into 371 verifiable components, with interaction scripts that gradually disclose design requirements. When tested, both Claude Code and Codex performed significantly worse under these 'emergent specification' conditions compared to having full specifications upfront from the start. Single-shot specification controls achieved higher implementation fidelity on 16/20 papers for Claude Code and 14/20 for Codex, showing that current agents struggle to track durable design commitments across long coding sessions.

As a solution, the researchers introduced ProjectGuard, an external project-state layer that helps agents track specifications. On Claude Code, ProjectGuard recovered 90% of the faithfulness gap, increased fully faithful components from 118 to 181, and reduced severe failures from 72 to 49. This demonstrates that specification tracking represents a distinct and crucial evaluation target for improving long-horizon coding agents, moving beyond current benchmarks that assume complete upfront knowledge.

Key Points

SLUMP benchmark tests AI coding agents on 20 recent ML papers with 371 components where specs emerge gradually
Claude Code and Codex showed significant faithfulness loss—performing better with full specs upfront on 16/20 and 14/20 papers respectively
ProjectGuard mitigation tool recovered 90% of accuracy gap on Claude Code, reducing severe failures from 72 to 49

Why It Matters

Reveals why AI coding assistants fail in real development workflows and provides a path to fix this $2B+ market gap.

Read Original Article

When the Specification Emerges: Benchmarking Faithfulness Loss in Long-Horizon Coding Agents

Why It Matters

Stay Ahead in AI