Enabling Up to 41% Faster Pre-training: MXFP8 and DeepEP for DeepSeek-V3 on B200 with TorchTitan
Combining new 8-bit precision and optimized communication slashes training time for massive 671B parameter models.
A joint technical effort between PyTorch and Nebius has demonstrated a major leap in training efficiency for massive Mixture-of-Experts (MoE) models. By running the 671B-parameter DeepSeek-V3 model on a 256-GPU NVIDIA B200 cluster using the TorchTitan framework, the team combined two novel optimizations: MXFP8 compute and DeepEP communication. MXFP8 leverages the B200's native 8-bit tensor cores to accelerate matrix multiplications, while DeepEP replaces inefficient standard all-to-all communication with purpose-built NVLink/RDMA kernels tailored for MoE's dynamic routing. This dual approach directly tackles the twin bottlenecks of compute and inter-GPU data transfer that plague large-scale model training.
The results are substantial. Using DeepEP alone boosted throughput by 32%, from 651 to 859 tokens per second. Adding MXFP8 acceleration on top pushed the total gain to 41%, achieving 918 tokens/sec. Crucially, for the smaller 16B MoE model, loss convergence experiments over 1,500 steps confirmed that MXFP8 training is numerically equivalent to standard BF16 precision, with no degradation in model quality. All experiments were run on Nebius Cloud using open-source, PyTorch-native tooling (TorchAO and DeepEP), ensuring full reproducibility for other researchers and companies.
This work signals a shift from simply chasing 'faster training' to achieving significantly better cost-performance, especially critical for the expensive pre-training of trillion-parameter-scale models. It validates the hardware-software co-design approach, where new GPU capabilities (Blackwell's MXFP8 support) are met with optimized software frameworks (TorchTitan, TorchAO) to unlock real-world speedups. The techniques are immediately applicable to any organization training large MoE architectures, potentially cutting cloud compute costs and development timelines.
- Achieved 41% total throughput gain (918 vs. 651 tokens/sec) training the 671B-parameter DeepSeek-V3 MoE model on 256 NVIDIA B200 GPUs.
- Combined MXFP8 low-precision compute (via TorchAO) and DeepEP's optimized MoE communication, tackling both major training bottlenecks.
- Proved MXFP8 training convergence is equivalent to BF16 for a 16B model, ensuring speed gains don't compromise model quality.
Why It Matters
Cuts the time and cost to train frontier AI models, accelerating development cycles for companies building the next generation of large language models.