Adaptive Computation Depth via Learned Token Routing in Transformers
Each token decides its own depth—saves 23% compute with <0.5% quality loss.
Researchers at arXiv have introduced Token-Selective Attention (TSA), a method that lets each token dynamically decide how many transformer layers to process. Standard transformers apply the same depth to every token, wasting compute on simple inputs. TSA adds a lightweight two-layer MLP gate between blocks that produces a continuous halting probability, making the routing end-to-end differentiable with only 1.7% additional parameters—no architecture changes needed.
Remarkably, TSA learns difficulty-proportional routing even without any depth regularisation (λ=0), skipping a full 20% of token-layer operations automatically. On character-level tasks (Tiny-Shakespeare, enwik8), TSA saved 14–23% of token-layer operations with less than 0.5% quality loss. When matched for efficiency, TSA achieved 0.7% lower validation loss than early-exit methods. The learned routing transfers directly to sparse inference, promising real wall-clock speedups for production transformer models.
- Adds a 1.7% parameter overhead with no changes to base transformer architecture
- Learns to skip 20% of token-layer operations even without any depth regularisation
- Saves 14–23% compute on character-level language modeling with <0.5% quality loss
Why It Matters
Makes large transformer inference cheaper and faster by letting each token self-select its computational depth.