Fully Looped Transformer stabilizes 12-loop training, boosts performance 13.2%
No more collapse: two parameter-free fixes unlock iterative reuse of transformer blocks.
A team of researchers from Jilin University and other institutions has introduced the Fully Looped Transformer (FLT), a parameter-free modification to Looped Transformers that solves two key sources of training instability: gradient oscillation and residual explosion. Looped Transformers reuse the same blocks iteratively, offering a way to trade compute for performance without increasing model size. However, they collapse when loop iterations exceed a few steps. FLT's Fully Looped Architecture distributes inter-loop signals across all layers, mitigating residual explosion, while Attention Injection reuses the existing attention block to suppress gradient oscillation. Experiments show FLT trains stably up to 12 loop iterations, whereas other looped models fail. Even at fewer iterations, FLT improves average downstream performance by up to 13.2%.
By decoupling parameter count from compute, FLT allows practitioners to adjust loop iterations at inference time, providing a natural mechanism for balancing performance and test-time compute budgets. This makes FLT a practical alternative to scaling model size for improving performance, especially in resource-constrained settings. The modifications are simple and require no additional training overhead, relying only on architectural changes to the existing transformer. The paper provides evidence that FLT not only stabilizes training but also enhances generalization across multiple benchmarks, suggesting that iterative computation can be a reliable path to better performance without the cost of larger models.
- Fully Looped Transformer (FLT) introduces two parameter-free modifications: Fully Looped Architecture (distributes signals across layers) and Attention Injection (reuses attention to reduce gradient oscillation).
- FLT trains stably up to 12 loop iterations, while baseline Looped Transformers collapse; downstream task performance improves by up to 13.2% in milder settings.
- FLT enables test-time compute scaling by varying loop iterations at inference, allowing performance-compute trade-offs without increasing model size or context length.
Why It Matters
FLT offers a simple, parameter-free way to improve model performance via iterative compute, making AI more efficient and adaptable.