Research & Papers

Outcome Rewards Do Not Guarantee Verifiable or Causally Important Reasoning

Outcome rewards may not lead to verifiable reasoning chains.

Deep Dive

A new paper from Qinan Yu, Alexa Tartaglini, and colleagues at Stanford and other institutions challenges a core assumption in AI training: that Reinforcement Learning from Verifiable Rewards (RLVR) on chain-of-thought reasoning leads models to genuinely reason. The team introduced two novel metrics—Causal Importance of Reasoning (CIR), which measures how much reasoning tokens affect the final answer, and Sufficiency of Reasoning (SR), which checks if a verifier can deduce the answer from reasoning alone. Testing the Qwen2.5 model series on ReasoningGym tasks, they found that while RLVR improves task accuracy, it does not reliably improve CIR or SR, suggesting models may rely on shortcuts rather than true reasoning.

The study also proposes remedies. A small amount of supervised fine-tuning (SFT) before RLVR can boost CIR and SR, as can adding auxiliary CIR/SR rewards on top of outcome-based rewards. This joint reward approach matches RLVR accuracy while ensuring causally important and sufficient reasoning. The findings suggest that common post-training methods may not produce the reasoning we assume, but simple modifications can fix this, making AI reasoning more transparent and trustworthy.

Key Points
  • RLVR improves task accuracy but not causal importance (CIR) or sufficiency (SR) of reasoning.
  • Adding a small amount of SFT before RLVR remedies low CIR and SR.
  • Auxiliary CIR/SR rewards with outcome rewards maintain accuracy while improving reasoning quality.

Why It Matters

This work reveals that current RLVR methods may not foster genuine reasoning, guiding better AI training.