Research & Papers

Learning Rate Transfer in Normalized Transformers

New parameterization enables hyperparameter transfer across model width, depth, and tokens

Deep Dive

Normalized transformers, introduced as nGPT in arXiv:2410.01131, offer impressive training speedups and eliminate the need for weight decay or learning rate warmup. However, the authors observe that despite having hyperparameters that explicitly scale with model size, nGPT fails to transfer learning rates across model dimension and token horizon. This means practitioners must re-tune hyperparameters for every model scale, undermining efficiency gains.

To address this, the team combines numerical experiments with alignment exponents from arXiv:2407.05872 and revisits the μP approach (arXiv:2011.14522). The result is νGPT, a reparameterization that exhibits learning rate transfer across width, depth, and token horizon. Extensive empirical validation confirms that νGPT eliminates the need for scale-specific tuning, making normalized transformers more practical for production deployments.

Key Points
  • nGPT's hyperparameters scale with model size but fail to transfer learning rates across dimensions
  • νGPT uses alignment exponents to modify the μP hyperparameter transfer method
  • Empirical validation shows νGPT achieves learning rate transfer across width, depth, and token horizon

Why It Matters

Simplifies hyperparameter tuning for scaled normalized transformers, reducing training costs and improving deployment efficiency.