PyTorch's new PR enables CUDA symmetric memory without NCCL
Torchcomms shim adds test coverage for rendezvous bypassing ProcessGroupNCCL...
PyTorch merged PR #184523, which adds test coverage for CUDA symmetric memory (symm_mem) when backed by torchcomms through the new _BackendWrapper shim. The PR introduces a test class `TorchCommsCudaSymmMemTest` that validates two rendezvous strategies: one where metadata allgather flows through the torchcomms-backed ProcessGroup, and another where metadata exchange falls back to the default TCPStore. In both cases, the test allocates a symmetric memory buffer, rendezvouses on the ProcessGroup, and verifies each rank can read its peers' buffers.
This test serves as both a regression guard and an example of how to use the _BackendWrapper with CUDA symmetric memory. By enabling symm_mem without requiring ProcessGroupNCCL, PyTorch opens the door to more flexible distributed training setups, especially in heterogeneous environments or when using custom communication backends. The PR was approved by ngimel and fduwjj, key maintainers of PyTorch's distributed module.
- PR #184523 adds tests for CUDASymmetricMemory rendezvous with a torchcomms-backed ProcessGroup
- Two variants: metadata allgather via PG or fallback to TCPStore
- Validates the _BackendWrapper shim path, enabling symm_mem without ProcessGroupNCCL
Why It Matters
Unlocks symmetric memory for custom backends, reducing NCCL dependency and enabling more flexible distributed training.