trunk/93dd7743c6577271a81f2fef0fdeafc5fe06e553: [SymmMem] put_signal and wait_signal (#174034)
This new PyTorch commit could dramatically speed up multi-GPU training for AI models.
A new commit to PyTorch's main branch introduces two backend-agnostic operations, `put_signal` and `wait_signal`, designed for one-sided communication between GPUs. These ops allow one GPU to directly write data into another's symmetric memory and signal its completion, bypassing slower traditional coordination methods. Currently, only an NCCL-based implementation is available, with support for other backends planned for the future. This is a core infrastructure change aimed at optimizing distributed training.
Why It Matters
Faster inter-GPU communication means significantly reduced training times for large language models and other complex AI systems.