Research & Papers

New scaling rules for Gated Delta Networks enable stable LR transfer

Researchers unlock feature learning in efficient Gated Delta Networks, enabling zero-shot hyperparameter transfer across widths.

Deep Dive

Training large language models demands enormous compute, driving interest in sub-quadratic architectures like Gated Delta Networks (GDNs) and in principled hyperparameter tuning. The Maximal Update Parametrization (μP) has shown zero-shot transfer of optimal hyperparameters across widths for standard Transformers, but its extension to linear models with structured state transitions and gating mechanisms was an open problem. In a new paper, UCLA researchers Yifeng Liu and Quanquan Gu rigorously propagate coordinate-size estimates through the forward pass, gating layers, and recurrent state dynamics of GDNs. They derive scaling rules that preserve feature learning as model width increases.

Experiments on language-model pre-training validate their theory: under both AdamW and SGD, the derived parametrization enables stable learning-rate transfer from small to large widths, while standard parametrization fails. This means practitioners can tune hyperparameters on a small proxy model and confidently deploy them on a much larger GDN without additional tuning. The work bridges a critical gap between μP theory and efficient architectures, potentially lowering the barrier to training state-of-the-art GDNs.

Key Points
  • Extends Maximal Update Parametrization (μP) to Gated Delta Networks, a sub-quadratic architecture with structured state transitions.
  • Derives scaling rules by analyzing forward pass, gating mechanisms, and recurrent dynamics to preserve feature learning across widths.
  • Validated on language-model pre-training: stable LR transfer under AdamW and SGD, while standard parametrization fails to transfer.

Why It Matters

Enables zero-shot hyperparameter tuning for efficient GDN architectures, cutting training cost and trial-and-error for large models.