LRMs' Human-Like Reasoning is Fixed, Not Adjustable by Budget
New research reveals giving AI more 'thinking time' doesn't make it think more like us.
A new study from arXiv (2026) investigates whether the alignment between Large Reasoning Models (LRMs) and human cognitive costs can be tuned by adjusting inference-time reasoning budgets. Across models GPT-OSS-20B and GPT-OSS-120B, three effort levels, and six reasoning tasks, the researchers measured how well chain-of-thought token lengths matched human reaction times. The results were striking: within-task and cross-task alignment remained numerically near-identical across all conditions, with Bayes Factors leaning decisively toward the null hypothesis. A manipulation check revealed that the effort parameter actually sets an upper generation ceiling, not a real-time allocation dial—meaning the model's policy for distributing reasoning tokens is crystallized during training and cannot be adjusted on the fly. However, scaling model size did improve the granularity of alignment on arithmetic complexity tasks, suggesting better learned difficulty patterns at larger scales.
These findings have significant implications for how we think about interpretability and control in AI systems. The paper supports a "compiled" (training-time) account of LRM problem-solving rather than an "online" (inference-time) one. For practitioners, this means adjusting reasoning budgets (e.g., increasing max tokens in chain-of-thought) won't make models emulate human cognitive patterns more closely—it merely adds verbose but shallow output. The alignment observed between LRM token allocation and human difficulty is a robust, built-in property of the training process. This challenges intuitive efforts to improve model alignment by simply giving models more compute at inference. Instead, any improvements to human-like reasoning must come from changes in training data, architecture, or objectives.
- Alignment between LRM chain-of-thought length and human reaction times was invariant across 3 effort levels and 6 reasoning tasks, with Bayes factors supporting the null.
- The 'effort' parameter acts as a ceiling on generation, not a dial; reasoning allocation policy is fixed during training.
- Model scale (120B vs 20B) improved the match between token allocation and fine-grained human difficulty patterns on arithmetic tasks.
Why It Matters
Reasoning budget adjustments won't make models think more like humans; alignment gains must come from training, not inference tweaks.