Research & Papers

Researchers unify conflicting scaling laws for long-context attention

A single framework resolves the log n versus (log n)^2 debate

Deep Dive

Hayase and Karakida propose a unified framework for the critical scaling of inverse temperature in self-attention. They introduce a gap-counting function, N_n, which determines the desirable scale for logit rescaling and the critical scale for softmax concentration: below this scale the top competitors remain unseparated, above it attention entropy collapses. The framework unifies prior scaling laws (including (log n)^{1/2}, log n, and (log n)^2) as different gap-counting functions, and provides a direct diagnostic tool for attention-score families from theoretical models to practical transformers.

Key Points
  • Unifies three conflicting scaling laws: (log n)^1/2, log n, and (log n)^2 into one framework using gap-counting function N_n.
  • Proves a critical threshold for inverse temperature that separates concentrated vs. diffuse attention.
  • Provides a practical diagnostic to compute optimal rescaling for any attention family without manual tuning.

Why It Matters

Ends guesswork in scaling logits for long-context transformers, enabling stable attention at 100K+ tokens.