veScale-FSDP: Flexible and High-Performance FSDP at Scale
New FSDP system from researchers enables block-wise quantization and advanced optimizers like Shampoo for models like Gemini.
A research team led by Zezhou Wang and 11 other authors has unveiled veScale-FSDP, a next-generation system designed to overcome critical bottlenecks in large-scale AI model training. The paper addresses the limitations of current Fully Sharded Data Parallel (FSDP or ZeRO) systems, which struggle with modern training techniques like block-wise quantized training and advanced optimizers such as Shampoo and Muon used in state-of-the-art models like Google's Gemini and Moonshot AI's Kimi K2. The core problem is that traditional FSDP's fixed sharding formats conflict with the block-structured computations these new methods require, creating inefficiencies that hamper scaling.
The veScale-FSDP solution couples a novel, flexible sharding format called RaggedShard with a structure-aware planning algorithm. This architectural redesign natively supports the efficient data placement needed for block-wise quantization and non-element-wise optimizers, which were previously incompatible with FSDP. The results are substantial: benchmarks show 5% to 66% higher training throughput and 16% to 30% lower memory usage compared to existing FSDP implementations. Crucially, the system demonstrates efficient scalability to clusters of tens of thousands of GPUs, removing a major barrier to training the next wave of multi-trillion parameter models. This work directly enables more efficient and flexible training of the frontier AI models that are pushing the limits of current hardware.
- Introduces 'RaggedShard' format and planning algorithm for flexible, structure-aware sharding, solving compatibility with block-wise training.
- Delivers 5-66% higher throughput and 16-30% lower memory usage versus current FSDP systems in benchmarks.
- Enables efficient scaling to tens of thousands of GPUs and supports advanced optimizers (Shampoo) for models like Gemini and Kimi K2.
Why It Matters
Removes a key scaling bottleneck for training trillion-parameter models, making advanced techniques like block-wise quantization feasible at scale.