trunk/680f7d7892f965d249bd0c617e1a0908439cf234: [ROCm] disable lowp check for kernel fusion (#175840)
A subtle code change disables fp16 precision checks to stabilize AMD GPU testing in PyTorch's inductor.
AMD engineer anvishwa-amd has landed a technical fix in PyTorch's main development trunk (commit 680f7d7, PR #175840) that addresses a persistent issue in the framework's CI testing pipeline for AMD's ROCm platform. The change disables the low-precision (fp16) comparison check in the `test_tield_kernel_fusion` test within PyTorch's inductor system, which was causing intermittent failures. These failures stemmed not from errors in the core float32 calculations but from minor numerical variances when the benchmark fusion optimizer selected different kernel fusion strategies between test runs—variances that become more pronounced in half-precision arithmetic. The fix specifically targets the ROCm backend, which powers AMD's data center GPUs like the MI300, and follows the established pattern of other inductor tests that have similarly disabled low-precision checks due to fusion-related precision tolerances.
The technical adjustment, setting `check_lowp=False`, allows the test to pass reliably on ROCm hardware while preserving the primary validation of the float32 computation path. This is a pragmatic engineering decision that prioritizes test stability and development velocity over enforcing strict fp16 equivalence, which can be inherently noisy with compiler-driven kernel fusion. The fix was verified by running the command `PYTORCH_TEST_WITH_ROCM=1 python test/inductor/test_benchmark_fusion.py BenchmarkFusionGpuTest.test_tield_kernel_fusion_cuda` on MI300 systems. For the PyTorch and AMD ecosystems, this represents a minor but important step in maturing the ROCm software stack, reducing friction for developers and ensuring smoother integration of performance optimizations like kernel fusion, which is critical for achieving peak AI training and inference speeds on AMD hardware.
- AMD engineer disabled fp16 checks in PyTorch's `test_tield_kernel_fusion` to fix ROCm CI flakiness (PR #175840).
- Failures were caused by small numerical differences in kernel fusion strategies, amplified in half-precision comparisons.
- Fix follows existing patterns from tests like `test_cumsum_zero_dim` and was validated on MI300 GPUs.
Why It Matters
Stabilizes testing for AMD's AI hardware platform, removing a blocker for developers optimizing PyTorch performance on ROCm.