The Path Not Taken: Duality in Reasoning about Program Execution
Researchers propose a benchmark that evaluates LLMs on two complementary reasoning tasks...
A new research paper accepted to ACL 2026 introduces DexBench, a benchmark designed to evaluate large language models' (LLMs) understanding of program execution through dual reasoning tasks. The authors argue that current benchmarks, which focus on predicting program properties tied to specific inputs (e.g., code coverage, outputs), offer a narrow view of dynamic code reasoning and are vulnerable to data contamination. DexBench instead probes LLMs' causal understanding of execution flow by requiring them to both predict a program's observed behavior for a given input and infer how the input must be mutated to achieve a specific behavioral objective.
The benchmark comprises 445 paired instances, and the team evaluated 13 LLMs, finding that dual-path reasoning provides a more robust and discriminative proxy for dynamic code understanding. This approach aims to ensure that LLMs truly grasp program execution rather than relying on surface-level patterns, addressing a critical gap in current evaluation methods. The findings suggest that future LLM development should incorporate such dual reasoning tasks to enhance reliability in software engineering and coding applications.
- DexBench includes 445 paired instances for dual reasoning tasks.
- Evaluated 13 LLMs on predicting behavior and inferring input mutations.
- Dual-path reasoning offers a more robust measure of dynamic code understanding.
Why It Matters
This benchmark could lead to more reliable LLMs for code generation and debugging.