Research & Papers

MuonRec: Shifting the Optimizer Paradigm Beyond Adam in Scalable Generative Recommendation

Researchers challenge Adam's dominance with Muon optimizer, delivering 12.6% better recommendations in fewer steps.

Deep Dive

A research team led by Rong Shan has introduced MuonRec, a framework challenging the long-standing dominance of the Adam/AdamW optimizer in training large-scale recommender systems (RecSys). As RecSys models grow in size and complexity, the choice of optimizer becomes critical, yet the field has largely defaulted to Adam without rigorous alternatives. MuonRec applies the novel Muon optimizer, which performs orthogonalized momentum updates for 2D weight matrices using Newton-Schulz iteration. This technique promotes more diverse update directions, leading to greater optimization efficiency. The team provides an open-source training recipe and validates it across both traditional sequential models and modern generative recommenders.

The results are significant: MuonRec reduces the number of training steps required for convergence by an average of 32.4% while simultaneously improving final ranking quality. It achieves consistent relative gains in the key metric NDCG@10, averaging 12.6% across all experimental settings, with particularly strong improvements for generative recommendation models. These gains consistently outperform strong Adam/AdamW baselines, suggesting Muon could become a new standard for RecSys training. The work highlights that optimizer innovation remains a high-leverage area for improving AI system efficiency and performance, especially as models scale. The availability of the code allows practitioners to test and integrate this approach into their own pipelines.

Key Points
  • Reduces converged training steps by an average of 32.4% compared to Adam/AdamW.
  • Improves ranking quality (NDCG@10) by an average of 12.6%, with pronounced gains in generative models.
  • Uses orthogonalized momentum updates via Newton-Schulz iteration for more diverse and efficient optimization of 2D weight matrices.

Why It Matters

Significantly cuts computational costs and time-to-train for large-scale recommender systems while delivering better user recommendations.