Research & Papers

Adaptive Computation Depth via Learned Token Routing in Transformers

Each token decides its own depth—saves 23% compute with <0.5% quality loss.

Deep Dive

Researchers at arXiv have introduced Token-Selective Attention (TSA), a method that lets each token dynamically decide how many transformer layers to process. Standard transformers apply the same depth to every token, wasting compute on simple inputs. TSA adds a lightweight two-layer MLP gate between blocks that produces a continuous halting probability, making the routing end-to-end differentiable with only 1.7% additional parameters—no architecture changes needed.

Remarkably, TSA learns difficulty-proportional routing even without any depth regularisation (λ=0), skipping a full 20% of token-layer operations automatically. On character-level tasks (Tiny-Shakespeare, enwik8), TSA saved 14–23% of token-layer operations with less than 0.5% quality loss. When matched for efficiency, TSA achieved 0.7% lower validation loss than early-exit methods. The learned routing transfers directly to sparse inference, promising real wall-clock speedups for production transformer models.

Key Points
  • Adds a 1.7% parameter overhead with no changes to base transformer architecture
  • Learns to skip 20% of token-layer operations even without any depth regularisation
  • Saves 14–23% compute on character-level language modeling with <0.5% quality loss

Why It Matters

Makes large transformer inference cheaper and faster by letting each token self-select its computational depth.