google found that longer chain of thought actually correlates NEGATIVELY with accuracy. -0.54 correlation
Longer chain-of-thought has -0.54 correlation with accuracy; new method cuts compute by 50%.
Google researchers have published a groundbreaking paper that fundamentally challenges a core assumption in AI reasoning. By analyzing eight leading models—including GPT-OSS, DeepSeek-R1, and Qwen3—across rigorous benchmarks like AIME 2024/2025 and GPQA-Diamond, they discovered a surprising -0.54 average correlation between reasoning chain length and answer accuracy. This indicates that longer, more verbose outputs often signal the model is 'spiraling' or overthinking rather than productively reasoning toward a correct solution. To address this, the team proposed a new metric called the Deep Thinking Ratio (DTR), which distinguishes between substantive reasoning and linguistic filler by tracking how token predictions stabilize across the model's neural network layers.
The DTR metric, which shows a strong 0.82 positive correlation with accuracy, enables a practical new inference strategy called Think@n. This method samples multiple reasoning paths, estimates each path's DTR from just the first 50 tokens, discards the bottom 50% of low-DTR samples, and performs a majority vote on the remainder. The result is a dramatic efficiency gain: in tests, token consumption dropped from 355.6k to 181.9k—a roughly 50% compute reduction—while accuracy improved. For instance, GPT-OSS-120B-medium achieved 94.7% on AIME 2025 with Think@n versus 92.7% with standard methods. This breakthrough has immediate implications for both local and cloud-based inference, allowing systems to early-terminate unproductive reasoning and allocate compute more effectively.
- Found -0.54 correlation between reasoning length and accuracy across 8 models (GPT-OSS, DeepSeek-R1, Qwen3).
- Introduced DTR metric with 0.82 accuracy correlation; Think@n strategy cuts compute by ~50% (355.6k to 181.9k tokens).
- GPT-OSS-120B-medium accuracy rose to 94.7% on AIME 2025 using Think@n versus 92.7% with standard sampling.
Why It Matters
Enables 50% compute savings for complex AI reasoning, making advanced models more efficient and accessible for local and cloud deployment.