Research & Papers

New study reveals small LMs cheat arithmetic by copying last number in CoT

Qwen and Llama 1-3B models copy the final number instead of reasoning, achieving 89-92% accuracy

Deep Dive

A new paper by Ming Liu (arXiv:2605.22870) exposes a fundamental flaw in how small language models (1-3B parameters) handle chain-of-thought (CoT) arithmetic reasoning. Testing on Qwen, Llama, and Gemma instruction-tuned models using the GSM8K benchmark, Liu found that the models overwhelmingly rely on a 'positional shortcut' during the answer-readout stage: they simply copy whichever number occupies the trailing position before the answer delimiter, regardless of the actual intermediate reasoning. This shortcut accounts for 54-92 percentage points of accuracy, representing 89-92% of each model's teacher-forcing ceiling. Even on incorrect items, the final answer matches the last CoT number 95-96% of the time.

The copy channel takes precedence over genuine computation. When researchers replaced the trailing number with a wrong value, accuracy collapsed to near zero despite correct intermediate steps. Conversely, removing that trailing number recovered 5-32 percentage points above that floor—even single-step arithmetic that the model could otherwise perform was suppressed when a copyable number was present. Qwen and Llama copied novel distractors 87-95% of the time, while Gemma showed selective gating. Head-level ablation identified architecture-specific head sets responsible for the shortcut. On non-arithmetic BBH tasks, shuffle retention dropped sharply, and at 7-8B scale, content-selective gating began to emerge. The findings have serious implications: CoT-based faithfulness evaluations risk conflating positional answer transport with genuine computation, undermining oversight methods that rely on CoT transparency.

Key Points
  • Small LMs (1-3B param) achieve 89-92% of their GSM8K accuracy by copying the last number in CoT, not reasoning
  • Replacing the trailing number with a wrong value collapses accuracy to near zero, confirming the shortcut's dominance
  • Qwen and Llama copy novel distractors 87-95% of the time; Gemma gates selectively, and 7-8B models show emerging content selection

Why It Matters

Challenges the assumption that CoT reveals genuine reasoning in small LMs, threatening oversight methods.