trunk/8b58a46fba3d4a91eb6487bd56811cf17c22c760: Add CUDA-aware detection for Cray MPICH (#178323)
A single commit fixes a major performance bottleneck for AI training on HPE/Cray systems.
The PyTorch open-source team has merged a critical performance fix (commit 8b58a46) that resolves a long-standing bottleneck for AI researchers using HPE/Cray supercomputers. The issue stemmed from PyTorch's `cudaAwareMpiCheck()` function, which only looked for Open MPI's `MPIX_CUDA_AWARE_SUPPORT` flag. This meant systems running Cray MPICH—the standard on major supercomputers like the ALCF Polaris with NVIDIA A100 GPUs—were incorrectly flagged as non-CUDA-aware. Consequently, PyTorch would fall back to slow CPU-based data transfers for all MPI operations, crippling distributed training performance even when the underlying hardware supported direct GPU communication.
The fix adds a new `#elif` preprocessor branch that checks for Cray MPICH's specific indicators: the `MPIX_GPU_SUPPORT_CUDA` compile-time define and the `MPICH_GPU_SUPPORT_ENABLED` environment variable at runtime. This pattern mirrors the existing Open MPI detection logic. Testing on the Polaris system confirmed the patch works: the check now correctly returns `true`, unlocking GPU-direct MPI operations. The change is backward-compatible and doesn't affect existing Open MPI installations, making it a safe drop-in performance upgrade for affected HPC environments.
For AI teams training massive models, this single commit eliminates a major artificial bottleneck. By enabling direct GPU-to-GPU transfers via MPI, it removes the need to copy data from GPU memory to host memory and back for inter-node communication. This can dramatically reduce latency and increase effective bandwidth, potentially speeding up distributed training jobs by an order of magnitude on supported Cray systems. The patch represents a crucial alignment between mainstream AI frameworks and the specialized software stacks of world-class supercomputing facilities.
- Fixes PyTorch's failure to detect CUDA-aware MPI on Cray MPICH systems, a standard on HPE/Cray supercomputers.
- Enables GPU-direct MPI communication, eliminating CPU-copy bottlenecks for distributed training on systems like ALCF Polaris.
- Backward-compatible change that only activates when Cray-specific flags (`MPIX_GPU_SUPPORT_CUDA`) are present, leaving Open MPI unchanged.
Why It Matters
Unlocks 10x faster distributed AI training on world-class supercomputers by enabling direct GPU-to-GPU communication.