Transformers with Selective Access to Early Representations [R]
A new architecture uses per-token gates to selectively reuse early features, boosting performance without overhead.
Researchers introduced SATFormer, a new Transformer architecture that improves efficiency-performance trade-offs by enabling selective access to early representations. Unlike prior methods like DenseFormer, MUDDFormer, and HyperConnections that add dense or dynamic cross-layer pathways at significant throughput and memory cost, SATFormer keeps the cheap first-layer value pathway from value residual learning. It replaces static layer-wise mixing with a per-token, per-head, context-dependent gate that learns when and where each head should re-access the first-layer value stream. Over model sizes from 130M to 1.3B parameters, SATFormer consistently improves validation loss compared to both vanilla Transformers and ResFormer baselines.
On retrieval-intensive benchmarks, SATFormer achieved the best average score among evaluated architectures, narrowly surpassing MUDDFormer and improving over ResFormer by about 1.5 points. Crucially, SATFormer runs at throughput close to standard Transformers and ResFormer, while HyperConnections and MUDDFormer lag by roughly 1.75–1.82x. Mechanistic analysis reveals that the gate’s access pattern is sparse, depth-dependent, head-specific, and stronger for particular tokens—meaning the architecture treats early-representation reuse as a retrieval/control problem rather than a connectivity/maximal routing problem. The paper and code are available on arXiv and GitHub.
- SATFormer uses per-token, per-head, context-dependent gates to selectively reuse the first-layer value stream.
- Outperforms standard Transformers and ResFormer on validation loss across 130M–1.3B parameter models.
- Achieves 1.75–1.82x higher throughput than HyperConnections and MUDDFormer while matching Transformer speed.
Why It Matters
Enables larger, more efficient Transformers for retrieval-heavy tasks without the throughput penalty of dense cross-layer connections.