Arithmetic OOD Failure Unfolds in Stages in Minimal GPTs
A new study shows why GPTs can't add 3-digit numbers, breaking the failure into layout, carry, recomposition, and tens-residual stages.
A new research paper by Seine A. Shintani provides a detailed, experimental decomposition of why minimal GPT models fail at simple arithmetic generalization. The study trained a model on exhaustive 2-digit addition, where all local digit transitions were present, yet it still failed on 3-digit Out-Of-Distribution (OOD) tasks. The failure wasn't monolithic but unfolded in four distinct, testable stages, offering a clearer diagnostic path than a single benchmark score.
First, a 'layout barrier' emerged: models using absolute position encoding collapsed when faced with a pure 3-digit layout shift. The only effective fix was exposing the model to mixed-layout data during training. Second, after layout repair, the model treated the hundreds digit like a simple carry flag rather than understanding its semantic value. Targeted 'carry probes' were needed to correct this.
The third bottleneck was 'conditional recomposition,' where the model struggled to correctly combine information from different parts of the problem. High-conditioned tail data was most effective for repair. Finally, residual errors were overwhelmingly concentrated in the tens digit. A late-stage, sign-aware intervention targeting this tens-residual stage boosted exact match accuracy on the hardest 'thousands-carry' test suite from 0.664 to 0.822 across 10 random seeds. This staged framework—layout, carry-semantics, recomposition, and tens-residual—provides a blueprint for systematically debugging and improving neural network reasoning.
- Identifies four failure stages: layout barriers, carry-flag semantics, conditional recomposition, and tens-residual errors.
- Targeted 'tens repair' intervention raised accuracy on hardest test suite from 66.4% to 82.2%.
- Shows that single benchmark scores conflate different types of failures, advocating for more granular diagnostics.
Why It Matters
Provides a framework for systematically debugging AI reasoning failures, moving beyond opaque benchmark scores to target specific model weaknesses.