DeepSeek-R1 study shows longer reasoning amplifies position bias in answers
More thinking doesn't mean fairer—it can actually make AI more biased.
A new study titled "More Thinking, More Bias: Length-Driven Position Bias in Reasoning Models" challenges the assumption that chain-of-thought (CoT) reasoning reduces heuristic biases. The author, Xiao Wang, tested reasoning-tuned models including DeepSeek-R1 (671B), R1-distilled 7-8B variants (R1-Qwen, R1-Llama), and base models with CoT prompting on MMLU, ARC-Challenge, and GPQA benchmarks. The key finding: within any reasoning-capable model, position bias (preferring answer A over B just because of its position) scales with the length of the reasoning trajectory. Across 13 reasoning-mode configurations, 12 showed a positive partial correlation between trajectory length and Position Bias Score (PBS), ranging from 0.11 to 0.41 (all p < 0.05). All 12 open-weight configurations showed monotonically increasing PBS across length quartiles.
A truncation intervention provided causal evidence: when the model's reasoning was stopped and resumed from later points in the trajectory, it increasingly shifted toward position-preferred options—16% to 32% for R1-Qwen-7B across absolute-position buckets. At the 671B scale, DeepSeek-R1's aggregate PBS collapsed to 0.019, but the length effect still manifested in the longest quartile (PBS = 0.071), suggesting that high accuracy masks but doesn't eliminate the underlying mechanism. The authors also found that direct-answer position bias (without CoT) is a distinct phenomenon with a different footprint. The study argues that reasoning models should not be treated as order-robust by default in MCQ evaluation pipelines, and provides a diagnostic toolkit (PBS, commitment change point, effective switching, truncation probes) for auditing position bias.
- Longer reasoning chains in R1-distilled models (7-8B) increased position bias by 0.11–0.41 correlation score across MMLU, ARC, GPQA.
- Truncation experiments caused models to shift 16%–32% more toward position-preferred answers as reasoning got longer.
- Even DeepSeek-R1 at 671B showed length-driven bias in its longest quartile (PBS = 0.071) despite overall bias collapsing to 0.019.
Why It Matters
Shows that 'thinking longer' doesn't guarantee fairness—AI evaluation pipelines must audit for reasoning-length bias.