Small LMs (1-3B param) achieve 89-92% of their GSM8K accuracy by copying the last number in CoT, not reasoning?

Small LMs (1-3B param) achieve 89-92% of their GSM8K accuracy by copying the last number in CoT, not reasoning

Replacing the trailing number with a wrong value collapses accuracy to near zero, confirming the shortcut's dominance?

Replacing the trailing number with a wrong value collapses accuracy to near zero, confirming the shortcut's dominance

Qwen and Llama copy novel distractors 87-95% of the time; Gemma gates selectively, and 7-8B models show emerging content selection?

Qwen and Llama copy novel distractors 87-95% of the time; Gemma gates selectively, and 7-8B models show emerging content selection

Research & Papers

New study reveals small LMs cheat arithmetic by copying last number in CoT

arXiv cs.LG May 25, 2026

⚡Qwen and Llama 1-3B models copy the final number instead of reasoning, achieving 89-92% accuracy

Deep Dive

A new paper by Ming Liu (arXiv:2605.22870) exposes a fundamental flaw in how small language models (1-3B parameters) handle chain-of-thought (CoT) arithmetic reasoning. Testing on Qwen, Llama, and Gemma instruction-tuned models using the GSM8K benchmark, Liu found that the models overwhelmingly rely on a 'positional shortcut' during the answer-readout stage: they simply copy whichever number occupies the trailing position before the answer delimiter, regardless of the actual intermediate reasoning. This shortcut accounts for 54-92 percentage points of accuracy, representing 89-92% of each model's teacher-forcing ceiling. Even on incorrect items, the final answer matches the last CoT number 95-96% of the time.

The copy channel takes precedence over genuine computation. When researchers replaced the trailing number with a wrong value, accuracy collapsed to near zero despite correct intermediate steps. Conversely, removing that trailing number recovered 5-32 percentage points above that floor—even single-step arithmetic that the model could otherwise perform was suppressed when a copyable number was present. Qwen and Llama copied novel distractors 87-95% of the time, while Gemma showed selective gating. Head-level ablation identified architecture-specific head sets responsible for the shortcut. On non-arithmetic BBH tasks, shuffle retention dropped sharply, and at 7-8B scale, content-selective gating began to emerge. The findings have serious implications: CoT-based faithfulness evaluations risk conflating positional answer transport with genuine computation, undermining oversight methods that rely on CoT transparency.

Key Points

Small LMs (1-3B param) achieve 89-92% of their GSM8K accuracy by copying the last number in CoT, not reasoning
Replacing the trailing number with a wrong value collapses accuracy to near zero, confirming the shortcut's dominance
Qwen and Llama copy novel distractors 87-95% of the time; Gemma gates selectively, and 7-8B models show emerging content selection

Why It Matters

Challenges the assumption that CoT reveals genuine reasoning in small LMs, threatening oversight methods.

Read Original Article

New study reveals small LMs cheat arithmetic by copying last number in CoT

Why It Matters

Related Articles

🚀 Stay Ahead in AI