T2J-Bench reveals AI coding agents fail 72% of codebase conversions
Best systems hit only 28% pass rate despite 91% surface checks.
A new paper from Microsoft researchers and colleagues introduces T2J-Bench, a rigorous benchmark designed to evaluate how well AI coding agents handle full codebase conversion. The benchmark reformulates conversion as a transfer task under a fixed equivalence contract, then compares source and converted codebases through three ordered stages: Spec (checking interface admissibility), Numeric (comparing forward outputs, losses, gradients, and objective-specific tensors), and Behavioral (observing short training dynamics under fixed seeds). This layered approach exposes a critical weakness in current AI agents: they often pass shallow validation but fail on deeper semantic guarantees.
Testing 355 blind conversion attempts, the best system achieved only a 26.7–28.9% overall pass rate, despite reaching 91.1% on the Spec stage alone. Notably, a 4.7x token-budget spread yielded only a 2.2x pass-rate spread, and all systems overestimated their success by 66.6–97.8 points relative to the fixed evaluator. This suggests the core problem is not model capacity or compute budget, but rather agents' tendency to trust their own incomplete validation routines. The findings have direct implications for deploying coding agents in production environments where semantic equivalence is critical.
- Best AI systems achieved only 26.7–28.9% overall pass rate on T2J-Bench despite 91.1% Spec-stage pass rate
- A 4.7x increase in token budget produced only a 2.2x improvement in pass rate
- All tested systems overestimated their success by 66.6 to 97.8 points compared to the fixed evaluator
Why It Matters
Exposes AI agents' overconfidence in code conversion, highlighting need for contract-based validation over shallow checks.