Muon uses one momentum buffer per parameter (vs Adam's two), reducing optimizer state memory by nearly half?

Muon uses one momentum buffer per parameter (vs Adam's two), reducing optimizer state memory by nearly half.

Newton-Schulz orthogonalization of momentum amplifies rare gradient directions, yielding 35% faster training on NanoGPT benchmarks?

Newton-Schulz orthogonalization of momentum amplifies rare gradient directions, yielding 35% faster training on NanoGPT benchmarks.

Already adopted by Moonshot AI, Zhipu AI (GLM-5 at 744B params), and DeepSeek (V4 at 1.6T params) for production LLM pretraining?

Already adopted by Moonshot AI, Zhipu AI (GLM-5 at 744B params), and DeepSeek (V4 at 1.6T params) for production LLM pretraining.

Developer Tools

Muon Optimizer in DeepSpeed cuts LLM training time by 35%

PyTorch Blog June 03, 2026

⚡One momentum buffer instead of two: frontier labs are switching.

Deep Dive

DeepSpeed, Microsoft's deep learning optimization library, has added support for the Muon Optimizer. Unlike standard AdamW which maintains two momentum buffers per parameter, Muon uses only one, saving optimizer state memory. The key innovation is an orthogonalization step applied to the momentum matrix using Newton-Schulz iterations. This equalizes all singular values of the update direction, counteracting the high condition numbers found in 2D weight matrices of transformers. In practice, this amplifies rare but important update directions that are otherwise overshadowed by dominant singular directions. On NanoGPT speedrunning benchmarks, Muon improved training speed by 35% over AdamW, and at 1.5B scale it reached GPT-2 XL level performance roughly 25% faster.

Major frontier AI labs have already adopted Muon Optimizer for training their largest models. Moonshot AI uses it for Kimi-K2-Thinking, a variant (MuonClip) for Kimi-K2. Zhipu AI confirmed Muon in both GLM-4.5 and the massive GLM-5 (744B parameters), introducing a "Muon Split" technique that orthogonalizes MLA up-projection matrices per attention head. DeepSeek-V4 (1.6T parameters) also relies on Muon for faster convergence and stability. DeepSpeed's implementation handles hybrid optimization: 2D hidden weights get Muon updates, while non-2D parameters (embeddings, norms, biases) fall back to AdamW. A finetuning demo with Moonlight-16B-A3B (MoE, 16B total, 3B active) showed improvements on MBPP, MMLU, and GSM8K benchmarks.

Key Points

Muon uses one momentum buffer per parameter (vs Adam's two), reducing optimizer state memory by nearly half.
Newton-Schulz orthogonalization of momentum amplifies rare gradient directions, yielding 35% faster training on NanoGPT benchmarks.
Already adopted by Moonshot AI, Zhipu AI (GLM-5 at 744B params), and DeepSeek (V4 at 1.6T params) for production LLM pretraining.

Why It Matters

Muon Optimizer slashes memory and speeds up LLM training, now battle-tested at frontier scale.

Read Original Article

Muon Optimizer in DeepSpeed cuts LLM training time by 35%

Why It Matters

Related Articles

🚀 Stay Ahead in AI