trunk/10a41bb26262d4685381108be2b2e138c3e83998: [MPS] Replace sum/nansum/mean ops wth native Metal kernel (#180709)
New Metal shaders replace MPSGraph for sum/nansum/mean ops, delivering 2-4x speedups on Apple Silicon.
The PyTorch team has merged a significant performance optimization (#180709) that replaces the existing MPSGraph implementations of tensor reduction operations—specifically torch.sum, torch.nansum, and torch.mean—with custom, low-level Metal shader kernels. This change routes these common operations through a new dispatch path matching CPU and CUDA backends, moving away from separate wrapper functions. The core innovation is a sophisticated kernel design with four specialized execution paths selected dynamically based on the tensor's shape and the reduction dimension. These include a two-pass multi-threadgroup approach for large scalar outputs, optimized outer and inner-dimension reductions using coalesced memory access and SIMD operations, and a general fallback for complex cases.
Benchmark results on an M4 Max chip show dramatic speedups, with dimension reductions seeing 2-4x improvements across data types. For instance, a dim=0 reduction on a (1024,1024) int64 tensor is now 3.16x faster, dropping from 23.4µs to 7.4µs. Full reductions also improved, with a 2.28x speedup for int64 operations. The update includes critical correctness tests to ensure numerical stability, particularly for low-precision types like fp16 and bfloat16, and prevents memory aliasing bugs that could cause incorrect results in repeated operations. This optimization directly benefits developers running PyTorch models on Macs, making common tensor manipulations significantly more efficient.
- Replaces MPSGraph with custom Metal kernels for sum/nansum/mean ops, matching CPU/CUDA dispatch paths
- Delivers up to 3.16x speedup for dimension reductions and 2.28x for full reductions on M4 Max
- Introduces four specialized kernel paths optimized for different tensor shapes and reduction scenarios
Why It Matters
Faster tensor operations accelerate AI model training and inference workflows on Apple Silicon Macs.