32x bandwidth reduction over float32 via 1-bit sign communication and majority-vote aggregation?

32x bandwidth reduction over float32 via 1-bit sign communication and majority-vote aggregation

Best validation accuracy (92.15%) across 330 CIFAR-10/ResNet-50 configurations?

Best validation accuracy (92.15%) across 330 CIFAR-10/ResNet-50 configurations

4-GPU variant achieves 92.02% accuracy with 37% less training time at matched effective batch?

4-GPU variant achieves 92.02% accuracy with 37% less training time at matched effective batch

Research & Papers

SignMuon slashes distributed training bandwidth by 32x with 1-bit optimization

arXiv cs.LG May 19, 2026

⚡New optimizer cuts communication 32x while hitting 92.15% accuracy on ResNet-50...

Deep Dive

Distributed training of large neural networks is often bottlenecked by full-precision gradient communication and coordinatewise optimizers that ignore matrix structure. SignMuon tackles both issues by using 1-bit sign communication via majority-vote aggregation from signSGD, combined with the polar-step framework from Muon. Each worker computes a Muon-style direction using a Newton-Schulz iteration on its momentum, transmits only entrywise signs, and aggregates via majority vote. An optional local polar step enforces orthogonality without extra communication. Theoretically, it yields an O(1/√T) nonconvex convergence rate under spectral-norm smoothness, with majority vote cutting stochastic noise by 1/√M across M workers.

Empirically, SignMuon delivers impressive results. Across 330 CIFAR-10/ResNet-50 configurations, it achieved the best validation accuracy of 92.15%. A 4-GPU majority-vote variant reached 92.02% accuracy while using 37% less training time at matched effective batch size. On nanoGPT, SignMuon outperformed other sign-based baselines in both perplexity and anytime performance, with favorable weak scaling up to 16 GPUs. The optimizer requires only one integer sum-allreduce per iteration, achieving a 32x bandwidth reduction over float32 (4x over int8). These results make SignMuon a practical, communication-efficient alternative for scaling distributed deep learning without sacrificing model quality.

Key Points

32x bandwidth reduction over float32 via 1-bit sign communication and majority-vote aggregation
Best validation accuracy (92.15%) across 330 CIFAR-10/ResNet-50 configurations
4-GPU variant achieves 92.02% accuracy with 37% less training time at matched effective batch

Why It Matters

Enables faster, cheaper distributed training of large models by dramatically reducing communication overhead without sacrificing accuracy.

Read Original Article

SignMuon slashes distributed training bandwidth by 32x with 1-bit optimization

Why It Matters

Related Articles

🚀 Stay Ahead in AI