Developer Tools

viable/strict/1778320003: Vectorized scatter_add with TMA bulk reduce on sm_90+ (#182675)

Avoids atomicAdd stalls by offloading accumulation to TMA unit, up to 4.6x speedup.

Deep Dive

The PyTorch team has merged a major optimization for scatter_add that leverages the TMA (Tensor Memory Accelerator) unit on NVIDIA Hopper (sm_90) and Blackwell architectures. The new fast path uses `cp.reduce.async.bulk.add` to offload the entire reduction to the TMA unit, avoiding the performance-killing serial atomicAdd stalls that plagued both the existing scatter_add and index_add implementations under high contention. The kernel uses warp-per-slice scheduling, double-buffered shared memory, and mbarrier-based completion tracking to maximize throughput. For pre-sm_90 GPUs, the implementation falls back to a vectorized kernel using `ld_vec<16>` plus per-element atomicAdd with warp-per-slice packing.

Benchmarks on NVIDIA GB200 hardware are striking: under high contention (7.77 million indices, 3.89 million targeting a single row), scatter_add achieves 4.6x speedup over the baseline index_add for 128-dim f32 tensors (18.4 ms vs 84.0 ms), and 4.7x for bf16. Even under uniform random distribution (500K indices, 100K rows), the speedup ranges from 1.5x to 3.4x depending on dtype and dimension size. The implementation supports float32, float64, float16, and bfloat16. It also fixes a subtle precedence bug in the existing `fast_gather_kernel_eligible` eligibility check. AMD ROCm GPUs like the gfx942/gfx950 are excluded from the fast path to preserve their existing optimized warp-level atomic coalescing.

Key Points
  • New fast path uses TMA bulk reduce (`cp.reduce.async.bulk`) to offload accumulation, avoiding serial atomicAdd stalls on sm_90+ (Hopper/Blackwell).
  • Benchmarks show 4.6x speedup for high-contention workloads (7.77M indices, 3.89M targeting one row) with 128 f32 tensors on GB200.
  • Includes vectorized scatter_add fallback for pre-sm_90 NVIDIA GPUs using `ld_vec<16>` + per-element atomicAdd, and fixes a precedence bug in `fast_gather_kernel_eligible`.

Why It Matters

Faster scatter_add accelerates gradient accumulation and sparse operations critical for large-scale ML training.