ExecVerify: White-Box RL with Verifiable Stepwise Rewards for Code Execution Reasoning
A new RL method uses white-box execution traces to verify every step of code reasoning, not just final output.
A research team led by Lingxiao Tang has published a paper on ExecVerify, a novel training framework that tackles a core weakness in current code LLMs: their inability to be trained on verifiably correct intermediate reasoning steps. Current methods rely on supervised fine-tuning (SFT) with teacher-generated explanations, which often reduces to text imitation because intermediate execution states cannot be explicitly checked for correctness. ExecVerify overcomes this by applying reinforcement learning (RL) with "white-box" rewards derived directly from program execution traces, such as predicting the next statement or a variable's value.
The system first builds a multi-difficulty dataset using constraint-based program synthesis to ensure controlled training complexity. It then uses a two-stage pipeline: first enhancing execution reasoning with these stepwise rewards, then transferring that knowledge to code generation tasks. The results are striking. Experiments show that a relatively small 7-billion-parameter model trained with ExecVerify achieves performance comparable to much larger 32B models on code reasoning benchmarks like HumanEval. Furthermore, it boosts pass@1 rates on code generation by up to 5.9% over strong post-training baselines, proving the method's effectiveness at aligning model behavior with true semantic correctness at every step.
- Uses white-box RL with rewards from execution traces (next-statement, variable value/type prediction) instead of text imitation.
- Trains on a synthetically generated dataset with controlled difficulty levels via constraint-based program synthesis.
- Enables a 7B parameter model to perform like a 32B model on reasoning and boost code generation pass@1 by 5.9%.
Why It Matters
Enables smaller, more efficient code models to achieve elite performance, reducing compute costs and improving reliability for developers.