Developer Tools

CoRE: A Fine-Grained Code Reasoning Benchmark Beyond Output Prediction

arXiv cs.SE April 29, 2026

⚡New benchmark shows AI models guess outputs without understanding code execution

Deep Dive

A team of researchers (Jun Gao, Yun Peng, et al.) from multiple institutions introduced CoRE, a fine-grained code reasoning benchmark that goes beyond traditional output prediction. Unlike existing benchmarks that evaluate LLMs solely on final output correctness under a single implementation, CoRE tests two critical aspects: implementation invariance (consistency across functionally equivalent code) and process transparency (accurate reasoning about intermediate execution states).

Evaluating eight frontier LLMs, the study uncovered two fundamental limitations. First, models exhibited a significant robustness gap, with performance varying dramatically across equivalent code implementations. Second, they observed superficial execution, where models arrived at correct final outputs without correctly reasoning about intermediate execution states. These findings demonstrate that output-only evaluations are insufficient for assessing true code reasoning, positioning CoRE as a necessary benchmark for robust and faithful code reasoning.

Key Points

CoRE tests implementation invariance and process transparency across functionally equivalent code
Eight frontier LLMs showed a robustness gap, with performance varying across equivalent implementations
Models exhibited superficial execution, getting correct outputs without reasoning about intermediate states

Why It Matters

Exposes that current LLM code reasoning is often guesswork, urging better benchmarks for reliable AI.

Read Original Article

CoRE: A Fine-Grained Code Reasoning Benchmark Beyond Output Prediction

Why It Matters

Stay Ahead in AI