trunk/761237cd9e236246bf701d484481251ae1e2ab0b: Enable Copy Engine all-gather in FSDP (#176613)
New symmetric memory allocation technique speeds up distributed training by overlapping communication with computation.
The PyTorch team has merged a significant optimization to their Fully Sharded Data Parallel (FSDP) distributed training framework. The update, implemented in pull request #176613, enables the use of NVIDIA's Copy Engine for all-gather operations by introducing symmetric memory allocation for communication buffers. This technical advancement allows the all-gather operations to overlap with GEMM (General Matrix Multiply) computations, which resulted in a measured 15% end-to-end speedup in microbenchmarks. The implementation adds a `SymmMemAllocMixin` to FSDP that allocates symmetric memory for all-gather buffers and includes memory pooling to enable buffer reuse without repeatedly calling rendezvous operations.
Users can activate this optimization through the new `set_symm_mem_for_comm` API, which triggers the symmetric memory allocation and enables the Copy Engine pathway. The team has already verified the functionality through the `TestFullyShardSymmMem` test case, with profiling showing that all-gather operations are now handled by the Copy Engine. Looking ahead, developers plan to extend similar symmetric memory allocation to reduce-scatter operations, which won't trigger Copy Engine but will enable newer, faster NCCL 2.29 kernels for improved scalability. This represents a meaningful step in optimizing distributed training efficiency for large language models and other compute-intensive AI workloads.
- Enables 15% faster distributed training by overlapping all-gather with GEMM operations
- Introduces symmetric memory allocation via new `SymmMemAllocMixin` and `set_symm_mem_for_comm` API
- Future optimization planned for reduce-scatter using NCCL 2.29's faster symmetric kernels
Why It Matters
Reduces training time and costs for large AI models by optimizing communication bottlenecks in distributed setups.