Extends Maximal Update Parametrization (μP) to Gated Delta Networks, a sub-quadratic architecture with structured state transitions?

Extends Maximal Update Parametrization (μP) to Gated Delta Networks, a sub-quadratic architecture with structured state transitions.

Derives scaling rules by analyzing forward pass, gating mechanisms, and recurrent dynamics to preserve feature learning across widths?

Derives scaling rules by analyzing forward pass, gating mechanisms, and recurrent dynamics to preserve feature learning across widths.

Validated on language-model pre-training?

stable LR transfer under AdamW and SGD, while standard parametrization fails to transfer.

Research & Papers

New scaling rules for Gated Delta Networks enable stable LR transfer

arXiv cs.LG June 04, 2026

⚡Researchers unlock feature learning in efficient Gated Delta Networks, enabling zero-shot hyperparameter transfer across widths.

Deep Dive

Training large language models demands enormous compute, driving interest in sub-quadratic architectures like Gated Delta Networks (GDNs) and in principled hyperparameter tuning. The Maximal Update Parametrization (μP) has shown zero-shot transfer of optimal hyperparameters across widths for standard Transformers, but its extension to linear models with structured state transitions and gating mechanisms was an open problem. In a new paper, UCLA researchers Yifeng Liu and Quanquan Gu rigorously propagate coordinate-size estimates through the forward pass, gating layers, and recurrent state dynamics of GDNs. They derive scaling rules that preserve feature learning as model width increases.

Experiments on language-model pre-training validate their theory: under both AdamW and SGD, the derived parametrization enables stable learning-rate transfer from small to large widths, while standard parametrization fails. This means practitioners can tune hyperparameters on a small proxy model and confidently deploy them on a much larger GDN without additional tuning. The work bridges a critical gap between μP theory and efficient architectures, potentially lowering the barrier to training state-of-the-art GDNs.

Key Points

Extends Maximal Update Parametrization (μP) to Gated Delta Networks, a sub-quadratic architecture with structured state transitions.
Derives scaling rules by analyzing forward pass, gating mechanisms, and recurrent dynamics to preserve feature learning across widths.
Validated on language-model pre-training: stable LR transfer under AdamW and SGD, while standard parametrization fails to transfer.

Why It Matters

Enables zero-shot hyperparameter tuning for efficient GDN architectures, cutting training cost and trial-and-error for large models.

Read Original Article

New scaling rules for Gated Delta Networks enable stable LR transfer

Why It Matters

Related Articles

🚀 Stay Ahead in AI