A Self-Attentive Meta-Optimizer with Group-Adaptive Learning Rates and Weight Decay
New optimizer dynamically adjusts learning rates per layer, beating AdamW across 5 tasks.
Researchers JiangBo Zhao and ZhaoXin Liu from arXiv have proposed MetaAdamW, a new optimizer that replaces the one-size-fits-all approach of adaptive optimizers like AdamW. Instead of applying uniform hyperparameters across all parameter groups, MetaAdamW uses a self-attention mechanism—a lightweight Transformer encoder—to dynamically modulate per-group learning rates and weight decay. The attention module processes statistical features such as gradient norms, momentum norms, and correlations extracted from each parameter group. To train this module, the team introduced a meta-learning objective that combines gradient alignment, loss decrease, and generalization gap. They also extended homoscedastic uncertainty weighting (HUW) with task-specific priorities to guide automatic loss balancing.
MetaAdamW was tested on five diverse tasks: time series forecasting (ETT), language modeling (WikiText-2), machine translation (Multi30k), image classification (CIFAR-10), and sentiment analysis (IMDB). Results consistently beat the standard AdamW baseline in validation loss, accuracy, or perplexity. Depending on the task, MetaAdamW either reduces overall training time by up to 17.11% or improves performance by up to 11.08% while introducing only moderate computational overhead. In some cases, it also mitigated issues of insufficient convergence caused by premature early stopping. Ablation studies validated each component, including feature versions, grouping strategies, and the priority-injected uncertainty weighting.
- MetaAdamW uses a lightweight Transformer encoder to dynamically adjust learning rates and weight decay per parameter group based on gradient statistics.
- Outperforms AdamW on 5 tasks: up to 17.11% faster training or up to 11.08% better performance.
- Includes a novel priority-injected homoscedastic uncertainty weighting for automated loss balancing.
Why It Matters
MetaAdamW promises smarter, faster training for large models, potentially reducing compute costs and time.