Scalable Training of Mixture-of-Experts Models with Megatron Core
New framework hits 1,233 TFLOPS/GPU on DeepSeek-V3-685B, enabling efficient trillion-parameter models.
A team of 43 NVIDIA researchers has released a comprehensive 88-page technical report introducing a new framework within Megatron Core specifically designed to overcome the systems challenges of training massive Mixture-of-Experts (MoE) models. Unlike dense models where every parameter is used for every input, MoE models activate only a subset of 'expert' sub-networks per token. This sparsity creates complex, coupled constraints across GPU memory, inter-GPU communication, and computational throughput, making traditional scaling techniques inefficient.
Megatron Core addresses this through integrated, co-designed optimizations across the entire stack. Key innovations include fine-grained recomputation and offloading for memory, optimized dispatchers and communication overlapping, and computational tricks like Grouped GEMM and CUDA Graphs. The framework also introduces 'Parallel Folding' for flexible multi-dimensional parallelism and supports low-precision training with FP8 and NVFP4 data types for further efficiency gains.
On NVIDIA's latest GB300 and GB200 superchips, the system demonstrated record performance, achieving 1,233 and 1,048 TFLOPS per GPU while training the DeepSeek-V3-685B model, and 974/919 TFLOPS/GPU on Qwen3-235B. As an open-source, production-ready solution, it has already been used in academia and industry to train MoE models ranging from billions to trillions of parameters on clusters scaling to thousands of GPUs. This work provides the practical systems-level guidance needed to push the frontier of model scale without prohibitive cost.
- Achieved 1,233 TFLOPS/GPU performance training the 685B-parameter DeepSeek-V3 MoE model on NVIDIA GB300/GB200 hardware.
- Solves the 'coupled constraints' problem of MoE sparsity with co-designed optimizations for memory, communication, and computation.
- Open-source framework (Megatron Core) supports training trillion-parameter models on clusters with thousands of GPUs.
Why It Matters
Provides the essential infrastructure for companies to build and train the next generation of trillion-parameter AI models cost-effectively.