Developer Tools

How Robustly do LLMs Understand Execution Semantics?

A new study reveals frontier models like GPT-5.2 struggle with code robustness despite near-perfect benchmark scores.

Deep Dive

A new study from researchers Claudio Spiess, Prem Devanbu, and Earl T. Barr reveals a significant weakness in how large language models understand code. The paper, titled 'How Robustly do LLMs Understand Execution Semantics?', tested models on a program-output prediction task using the CRUXEval benchmark. The findings show a stark divergence: OpenAI's frontier model GPT-5.2 achieved a near-perfect 99% accuracy on original, unperturbed code but saw its performance plummet by 20-24% when faced with semantically equivalent but syntactically perturbed inputs. This suggests the model may rely heavily on sophisticated pattern matching rather than developing a robust internal world model of code execution.

In contrast, open-source reasoning models like the DeepSeek-R1 family demonstrated more stable, albeit lower, performance—maintaining accuracies between 38% and 67% across both original and perturbed inputs. The research also uncovered a specific vulnerability: most models performed significantly worse at predicting the behavior of code that raises exceptions, and their performance varied depending on the type of exception. The authors evaluated potential remedies for this deficiency and established the value of using code perturbation as a critical method for evaluating the true understanding of code models, moving beyond static benchmark scores.

Key Points
  • GPT-5.2's accuracy dropped 20-24% on perturbed code inputs despite a 99% score on the original CRUXEval benchmark.
  • Open-source models like DeepSeek-R1 showed more stable performance (38-67% accuracy) across both original and transformed code.
  • All models struggled significantly with predicting code behavior that results in exceptions, revealing a common weakness in semantic understanding.

Why It Matters

This exposes a critical gap between benchmark performance and real-world coding reliability, impacting developers who trust AI for code generation and review.