Research & Papers

Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens

Forget counting tokens. This new metric reveals when an AI is truly reasoning.

Deep Dive

A new research paper introduces 'deep-thinking tokens'—a novel metric that measures LLM reasoning effort by tracking significant internal revisions in deeper model layers. It shows a robust positive correlation with accuracy across four challenging benchmarks (AIME, HMMT, GPQA) and models like GPT-OSS and DeepSeek-R1, substantially outperforming length-based or confidence-based methods. The insight enables 'Think@n', a new scaling strategy that reduces inference costs by early rejection of unpromising generations.

Why It Matters

This could slash AI inference costs and improve accuracy by identifying when models are genuinely reasoning, not just 'overthinking'.