Developer Tools

trunk/e2584b2554d11fda4998a8d2be6145b0eded5049: [ROCm] Enable rocSHMEM (#173518)

A major PyTorch commit enables symmetric memory operations on AMD hardware, fixing 20+ issues.

Deep Dive

The PyTorch development team has successfully merged a foundational pull request (PR #173518) that ports NVSHMEM-based symmetric memory collective operations to AMD's ROCm platform by integrating rocSHMEM. This technical achievement enables PyTorch's existing `torch.ops.symm_mem.*` application programming interfaces (APIs) to function natively on AMD GPUs, eliminating a key software barrier. The ported operations are critical for high-performance distributed training and include point-to-point functions (`put`, `get`), collective operations (`broadcast`, `all_to_all`), and specialized 2D AllToAllv variants designed for efficient Mixture-of-Experts (MoE) model architectures.

Key engineering decisions were required due to architectural differences between NVIDIA's and AMD's platforms. Instead of cluttering the existing NVSHMEM code with conditional compilation statements (`#ifdef`), the team created a separate `rocshmem_extension.cu` file. This was necessary because rocSHMEM requires relocatable device code compilation (`-fgpu-rdc`), while the rest of PyTorch's HIP code does not. Furthermore, the lack of a grid-wide synchronization primitive in rocSHMEM (compared to NVSHMEM's `nvshmemx_collective_launch`) necessitated a separate, single-block kernel to safely write output offsets after data exchange, preventing race conditions. A hipification layer maps host API calls, allowing higher-level management code to remain largely backend-agnostic.

This integration resolves more than 20 specific GitHub issues related to ROCm support for these memory operations, marking a substantial milestone in maturing the AI software stack for AMD hardware. It directly benefits developers and researchers aiming to run large-scale, distributed PyTorch workloads on AMD Instinct GPUs, providing them with performant, standardized APIs for symmetric memory access that were previously exclusive to NVIDIA systems.

Key Points
  • Enables PyTorch's `torch.ops.symm_mem.*` APIs on AMD GPUs via rocSHMEM integration, porting key collective operations.
  • Uses a separate compilation unit to handle rocSHMEM's requirement for relocatable device code (`-fgpu-rdc`), avoiding build contamination.
  • Fixes over 20 specific GitHub issues, significantly improving ROCm's compatibility for distributed training and MoE model workloads.

Why It Matters

This reduces vendor lock-in for AI training, giving teams a viable, performant PyTorch path on AMD hardware for large models.