MXFP8 Training for MoEs: 1.3x training speedup vs BF16 for Llama4 Scout on GB200 cluster using TorchAO and TorchTitan
New MXFP8 training method delivers a 1.3x speedup on a 256-GPU cluster with no loss in model quality.
Researchers from Meta's PyTorch team have achieved a major breakthrough in training efficiency for large language models. Using a new, custom 8-bit floating-point format called MXFP8 within their TorchAO optimization library, they demonstrated a 30.2% end-to-end speedup in training the Llama4 Scout, a Mixture-of-Experts (MoE) model, on a powerful 256-GPU GB200 cluster. Crucially, the model trained with MXFP8 showed virtually identical convergence to a baseline trained with the standard bfloat16 precision, meaning the speed boost came with no trade-off in final model quality.
The key innovation is a new PyTorch operation, `_to_mxfp8_then_scaled_grouped_mm`, which dynamically quantizes the model's weights and activations to the efficient MXFP8 format just before computation. This is particularly effective for the grouped matrix multiplications (GEMMs) in MoE models' "routed experts." The operation is up to 1.8x faster than its bfloat16 equivalent for these shapes. The team integrated this into their TorchTitan training framework, making the technique accessible for reproducing these results on large-scale clusters.
This advancement directly tackles one of the biggest bottlenecks in modern AI: the immense cost and time required to train trillion-parameter MoE models. By squeezing more performance out of existing hardware, MXFP8 training can significantly reduce the financial and environmental costs of developing frontier AI, potentially accelerating the pace of research and deployment for the next generation of models.
- Achieved a 30.2% training speedup for the Llama4 Scout MoE model using the new MXFP8 data format.
- Demonstrated equivalent convergence to bfloat16 baseline over 3,000+ training steps on a 256-GPU GB200 cluster.
- Enabled via a new TorchAO API (`_to_mxfp8_then_scaled_grouped_mm`) that dynamically quantizes GEMMs for MoE layers.
Why It Matters
Cuts the time and cost of training massive AI models by over 30%, making frontier model development more accessible.