Research & Papers

veScale-FSDP: Flexible and High-Performance FSDP at Scale

arXiv cs.DC February 27, 2026

⚡New FSDP system from researchers enables block-wise quantization and advanced optimizers like Shampoo for models like Gemini.

Deep Dive

A research team led by Zezhou Wang and 11 other authors has unveiled veScale-FSDP, a next-generation system designed to overcome critical bottlenecks in large-scale AI model training. The paper addresses the limitations of current Fully Sharded Data Parallel (FSDP or ZeRO) systems, which struggle with modern training techniques like block-wise quantized training and advanced optimizers such as Shampoo and Muon used in state-of-the-art models like Google's Gemini and Moonshot AI's Kimi K2. The core problem is that traditional FSDP's fixed sharding formats conflict with the block-structured computations these new methods require, creating inefficiencies that hamper scaling.

The veScale-FSDP solution couples a novel, flexible sharding format called RaggedShard with a structure-aware planning algorithm. This architectural redesign natively supports the efficient data placement needed for block-wise quantization and non-element-wise optimizers, which were previously incompatible with FSDP. The results are substantial: benchmarks show 5% to 66% higher training throughput and 16% to 30% lower memory usage compared to existing FSDP implementations. Crucially, the system demonstrates efficient scalability to clusters of tens of thousands of GPUs, removing a major barrier to training the next wave of multi-trillion parameter models. This work directly enables more efficient and flexible training of the frontier AI models that are pushing the limits of current hardware.

Key Points

Introduces 'RaggedShard' format and planning algorithm for flexible, structure-aware sharding, solving compatibility with block-wise training.
Delivers 5-66% higher throughput and 16-30% lower memory usage versus current FSDP systems in benchmarks.
Enables efficient scaling to tens of thousands of GPUs and supports advanced optimizers (Shampoo) for models like Gemini and Kimi K2.

Why It Matters

Removes a key scaling bottleneck for training trillion-parameter models, making advanced techniques like block-wise quantization feasible at scale.

Read Original Article

veScale-FSDP: Flexible and High-Performance FSDP at Scale

Why It Matters

Stay Ahead in AI