Researchers unify conflicting scaling laws for long-context attention
A single framework resolves the log n versus (log n)^2 debate
Hayase and Karakida propose a unified framework for the critical scaling of inverse temperature in self-attention. They introduce a gap-counting function, N_n, which determines the desirable scale for logit rescaling and the critical scale for softmax concentration: below this scale the top competitors remain unseparated, above it attention entropy collapses. The framework unifies prior scaling laws (including (log n)^{1/2}, log n, and (log n)^2) as different gap-counting functions, and provides a direct diagnostic tool for attention-score families from theoretical models to practical transformers.
- Unifies three conflicting scaling laws: (log n)^1/2, log n, and (log n)^2 into one framework using gap-counting function N_n.
- Proves a critical threshold for inverse temperature that separates concentrated vs. diffuse attention.
- Provides a practical diagnostic to compute optimal rescaling for any attention family without manual tuning.
Why It Matters
Ends guesswork in scaling logits for long-context transformers, enabling stable attention at 100K+ tokens.