trunk/50294ed45005ceb3b8669c68b408236e3f06d6b1: [MPS] Flatten 5D tensors to 4D in batch_norm for performance (#180335)
A clever tensor reshape trick cuts BatchNorm3d runtime from 8.7ms to 3.5ms on M4 Pro chips.
The PyTorch development team has merged a performance-critical optimization (commit 50294ed) for Apple Silicon's Metal Performance Shaders (MPS) backend. The fix addresses a significant slowdown in BatchNorm3d operations, which process 5D tensors with shape [N, C, D, H, W] for 3D data like volumetric medical scans or video. The core issue was that MPSGraph's normalization operators were inherently slower for rank-5 tensors compared to rank-4, even with identical element counts. The solution is elegantly simple: before calling the underlying MPS kernels, the code now flattens the spatial dimensions, transforming the tensor from [N, C, D, H, W] to [N, C, D*H, W]. This allows the operation to recurse into the existing, highly optimized 4D processing path.
Benchmarks on an Apple M4 Pro with a tensor shape of [4, 64, 64, 64, 64] show dramatic gains. The combined forward and backward pass for `nn.BatchNorm3d` dropped from 8.7 milliseconds to just 3.5 milliseconds—a 2.4x speedup. Crucially, the optimized native `batch_norm` forward and backward functions now achieve performance parity with their 4D (BatchNorm2d) counterparts. The team verified correctness across all existing MPS batch normalization tests, with a maximum numerical difference of 2.38e-07 from the original 5D computation, which is within acceptable float32 machine epsilon. This optimization directly benefits developers working in 3D computer vision, scientific computing, and AI for healthcare on Mac platforms.
- Performance Boost: Combined forward/backward for BatchNorm3d is 2.4x faster, from 8.7ms to 3.5ms on M4 Pro.
- Technical Fix: Flattens 5D tensors ([N,C,D,H,W]) to 4D ([N,C,D*H,W]) to use faster MPS 4D kernels.
- Maintains Accuracy: Max numerical difference is 2.38e-07, passing all existing PyTorch MPS tests.
Why It Matters
This significantly accelerates 3D model training on Apple Silicon, benefiting fields like medical imaging and video analysis where BatchNorm3d is essential.