trunk/517a68542cfc68a865998c678d25b271a713417a: [ROCm][CI] fix batch_norm decomp for negative running_var (#177665)
A subtle bug in PyTorch's batch normalization was causing widespread NaN errors on AMD's MI300X AI accelerators.
The PyTorch development team has patched a significant bug in their framework's ROCm backend that was causing widespread failures for AI models running on AMD's latest MI300X accelerators. The issue, documented as pull request #177665, stemmed from how PyTorch's just-in-time (JIT) compiler handled batch normalization operations when fused with ReLU activations. Specifically, when the running variance parameter contained negative values—a mathematical impossibility in theory but possible with certain initialization methods like torch.randn—the compiler would attempt to calculate square roots of negative numbers during optimization passes, resulting in NaN (not-a-number) outputs that corrupted entire training runs.
Testing revealed the bug caused 31.2% tensor mismatches (25,600 out of 81,920 elements) and complete test failures on systems with ROCm version 7.2.26015. The problem only manifested when compiler fusion was enabled for batch normalization followed by ReLU operations, creating a subtle but devastating interaction. Engineers traced the issue to the mathematical decomposition used during JIT compilation and implemented two fixes: modifying the batch_norm decomposition to properly handle negative running variance edge cases, and updating test initialization to ensure non-negative variance values.
This fix is particularly crucial for the AI hardware ecosystem as AMD pushes its MI300X accelerators as competitive alternatives to NVIDIA's dominant GPUs. Stability in core operations like batch normalization—a fundamental component in nearly all modern neural networks—is essential for researchers and companies deploying production AI systems. The patch demonstrates PyTorch's ongoing commitment to multi-vendor hardware support while highlighting the complex interactions between mathematical correctness and compiler optimizations in deep learning frameworks.
- Bug caused 31.2% tensor mismatches and NaN outputs when batch normalization fused with ReLU on AMD GPUs
- Root cause was negative running_var values leading to sqrt(-x) during JIT compiler optimization passes
- Fix modifies mathematical decomposition and test initialization to ensure stability on MI300X with ROCm 7.2
Why It Matters
Ensures stable AI training on AMD's competitive MI300X accelerators, maintaining hardware diversity in the AI ecosystem.