trunk/24a1530f3aeb5a81f9def5198004a3452953221c: [FSDP2] support per-param mesh (#173509)
This could dramatically speed up training for massive MoE models...
Deep Dive
A new PyTorch commit (PR #173509) adds per-parameter mesh support to FSDP2, allowing developers to specify different hardware meshes for experts versus non-experts within a transformer block. This enables more efficient scheduling of all-gather operations, potentially optimizing memory usage and training speed for complex Mixture-of-Experts architectures. The change is backward compatible, meaning existing FSDP2 code won't break. Experiments are referenced in the TorchTitan repository.
Why It Matters
This paves the way for faster, more efficient training of next-generation trillion-parameter AI models.