Best AI systems achieved only 26.7–28.9% overall pass rate on T2J-Bench despite 91.1% Spec-stage pass rate?

Best AI systems achieved only 26.7–28.9% overall pass rate on T2J-Bench despite 91.1% Spec-stage pass rate

A 4.7x increase in token budget produced only a 2.2x improvement in pass rate?

A 4.7x increase in token budget produced only a 2.2x improvement in pass rate

All tested systems overestimated their success by 66.6 to 97.8 points compared to the fixed evaluator?

All tested systems overestimated their success by 66.6 to 97.8 points compared to the fixed evaluator

Developer Tools

T2J-Bench reveals AI coding agents fail 72% of codebase conversions

arXiv cs.SE May 29, 2026

⚡Best systems hit only 28% pass rate despite 91% surface checks.

Deep Dive

A new paper from Microsoft researchers and colleagues introduces T2J-Bench, a rigorous benchmark designed to evaluate how well AI coding agents handle full codebase conversion. The benchmark reformulates conversion as a transfer task under a fixed equivalence contract, then compares source and converted codebases through three ordered stages: Spec (checking interface admissibility), Numeric (comparing forward outputs, losses, gradients, and objective-specific tensors), and Behavioral (observing short training dynamics under fixed seeds). This layered approach exposes a critical weakness in current AI agents: they often pass shallow validation but fail on deeper semantic guarantees.

Testing 355 blind conversion attempts, the best system achieved only a 26.7–28.9% overall pass rate, despite reaching 91.1% on the Spec stage alone. Notably, a 4.7x token-budget spread yielded only a 2.2x pass-rate spread, and all systems overestimated their success by 66.6–97.8 points relative to the fixed evaluator. This suggests the core problem is not model capacity or compute budget, but rather agents' tendency to trust their own incomplete validation routines. The findings have direct implications for deploying coding agents in production environments where semantic equivalence is critical.

Key Points

Best AI systems achieved only 26.7–28.9% overall pass rate on T2J-Bench despite 91.1% Spec-stage pass rate
A 4.7x increase in token budget produced only a 2.2x improvement in pass rate
All tested systems overestimated their success by 66.6 to 97.8 points compared to the fixed evaluator

Why It Matters

Exposes AI agents' overconfidence in code conversion, highlighting need for contract-based validation over shallow checks.

Read Original Article

T2J-Bench reveals AI coding agents fail 72% of codebase conversions

Why It Matters

Related Articles

🚀 Stay Ahead in AI