Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning
New research shows memory tokens are essential for adaptive recursive reasoning in transformers.
Grigory Sapunov's paper, 'Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning,' investigates the necessity of learned memory tokens for a single-block Universal Transformer (UT) with Adaptive Computation Time (ACT) on the Sudoku-Extreme benchmark. The study finds that memory tokens are empirically necessary: across all tested configurations, no setup without memory tokens achieves non-trivial performance. The optimal count of memory tokens shows a sharp lower threshold (T=0 always fails, T=4 is borderline, T=8 reliably succeeds for 81-cell puzzles) followed by a stable plateau (T=8-32, 57.4% +/- 0.7% exact-match) and collapse from attention dilution at T=64.
During experimentation, Sapunov identifies a router initialization trap that causes >70% of training runs to fail. Both default zero-bias initialization (p ~ 0.5) and Graves' recommended positive bias (p ~ 0.73) cause tokens to halt after ~2 steps at initialization, settling into a shallow equilibrium (halt ~ 5-7) that the model cannot escape. Inverting the bias to -3 ('deep start,' p ~ 0.05) eliminates this failure mode. With reliable training established, the study shows that ACT provides more consistent results than fixed-depth processing (56.9% +/- 0.7% vs 53.4% +/- 9.3% across 3 seeds). ACT with lambda warmup achieves matching accuracy (57.0% +/- 1.1%) using 34% fewer ponder steps. Additionally, attention heads specialize into memory readers, constraint propagators, and integrators across recursive depth.
- Memory tokens are empirically necessary for Universal Transformers to perform complex reasoning on the Sudoku-Extreme benchmark.
- Optimal performance requires 8-32 memory tokens (57.4% exact-match), but 64 tokens cause attention dilution and performance collapse.
- A router initialization trap causes >70% of training runs to fail, solved by using a negative bias of -3 for a 'deep start.'
Why It Matters
This research provides a critical insight for designing efficient adaptive transformers, reducing training failures and computational costs.