CoRE: A Fine-Grained Code Reasoning Benchmark Beyond Output Prediction
New benchmark shows AI models guess outputs without understanding code execution
A team of researchers (Jun Gao, Yun Peng, et al.) from multiple institutions introduced CoRE, a fine-grained code reasoning benchmark that goes beyond traditional output prediction. Unlike existing benchmarks that evaluate LLMs solely on final output correctness under a single implementation, CoRE tests two critical aspects: implementation invariance (consistency across functionally equivalent code) and process transparency (accurate reasoning about intermediate execution states).
Evaluating eight frontier LLMs, the study uncovered two fundamental limitations. First, models exhibited a significant robustness gap, with performance varying dramatically across equivalent code implementations. Second, they observed superficial execution, where models arrived at correct final outputs without correctly reasoning about intermediate execution states. These findings demonstrate that output-only evaluations are insufficient for assessing true code reasoning, positioning CoRE as a necessary benchmark for robust and faithful code reasoning.
- CoRE tests implementation invariance and process transparency across functionally equivalent code
- Eight frontier LLMs showed a robustness gap, with performance varying across equivalent implementations
- Models exhibited superficial execution, getting correct outputs without reasoning about intermediate states
Why It Matters
Exposes that current LLM code reasoning is often guesswork, urging better benchmarks for reliable AI.