New study: 61-93% of LLM reasoning steps are redundant
Massive over-thinking exposed in frontier models like GPT-4 and Claude.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A new paper from researchers Zhiyuan Zhai, Xinkai You, Wenjing Yan, and Xin Wang, titled "How Much Thinking is Enough? Quantifying and Understanding Redundancy in LLM Reasoning," measures for the first time how much of a reasoning model's chain-of-thought is actually necessary. They formalize redundancy as the largest fraction of trailing steps that can be truncated while the model still outputs the correct answer. Testing four frontier reasoning models on MATH-500 and another benchmark, they found step-level redundancy ranges from 61% to 93% across eight conditions. The median critical prefix is just a single segmented step in six of eight cases—meaning most of the thinking is wasted.
The paper goes further by proving this over-thinking is not a quirk of any particular model. It is a structural consequence of training with length-agnostic outcome rewards: under any such reward, no finite expected stopping time is optimal. This means that regardless of RL algorithm, base model, or data distribution, reasoning models will inherently produce excessive deliberation. The authors suggest that future work should focus on training methods that reward efficiency or incorporate explicit reasoning budgets, rather than patching individual models.
- Step-level redundancy between 61% and 93% across four frontier models on math benchmarks.
- Median critical prefix is a single segmented step in six of eight model-benchmark conditions.
- Over-thinking proven to be structural, caused by length-agnostic outcome rewards, not model bugs.
Why It Matters
This explains why LLMs are slow and costly—and points to fundamental training changes needed for efficiency.