PyTorch adds new 'put_signal' and 'wait_signal' ops for faster AI training
This new PyTorch commit could dramatically speed up multi-GPU training for AI models.
A new commit to PyTorch's main branch introduces two backend-agnostic operations, `put_signal` and `wait_signal`, designed for one-sided communication between GPUs. These ops allow one GPU to directly write data into another's symmetric memory and signal its completion, bypassing slower traditional coordination methods. Currently, only an NCCL-based implementation is available, with support for other backends planned for the future. This is a core infrastructure change aimed at optimizing distributed training.
Why It Matters
Faster inter-GPU communication means significantly reduced training times for large language models and other complex AI systems.