SignMuon slashes distributed training bandwidth by 32x with 1-bit optimization
New optimizer cuts communication 32x while hitting 92.15% accuracy on ResNet-50...
Distributed training of large neural networks is often bottlenecked by full-precision gradient communication and coordinatewise optimizers that ignore matrix structure. SignMuon tackles both issues by using 1-bit sign communication via majority-vote aggregation from signSGD, combined with the polar-step framework from Muon. Each worker computes a Muon-style direction using a Newton-Schulz iteration on its momentum, transmits only entrywise signs, and aggregates via majority vote. An optional local polar step enforces orthogonality without extra communication. Theoretically, it yields an O(1/√T) nonconvex convergence rate under spectral-norm smoothness, with majority vote cutting stochastic noise by 1/√M across M workers.
Empirically, SignMuon delivers impressive results. Across 330 CIFAR-10/ResNet-50 configurations, it achieved the best validation accuracy of 92.15%. A 4-GPU majority-vote variant reached 92.02% accuracy while using 37% less training time at matched effective batch size. On nanoGPT, SignMuon outperformed other sign-based baselines in both perplexity and anytime performance, with favorable weak scaling up to 16 GPUs. The optimizer requires only one integer sum-allreduce per iteration, achieving a 32x bandwidth reduction over float32 (4x over int8). These results make SignMuon a practical, communication-efficient alternative for scaling distributed deep learning without sacrificing model quality.
- 32x bandwidth reduction over float32 via 1-bit sign communication and majority-vote aggregation
- Best validation accuracy (92.15%) across 330 CIFAR-10/ResNet-50 configurations
- 4-GPU variant achieves 92.02% accuracy with 37% less training time at matched effective batch
Why It Matters
Enables faster, cheaper distributed training of large models by dramatically reducing communication overhead without sacrificing accuracy.