PyTorch fixes TF32 heuristics to prevent numerical regressions in max-autotune
Reverse-engineering cuBLAS from over 2 million shapes to fix TF32 accuracy.
PyTorch has merged a critical fix to its Inductor compiler's ALLOW_TF32 heuristics, addressing numerical regressions that occurred when using TensorFloat32 (TF32) with max-autotune mode. The issue was that previous heuristics relied on general alignment rules that failed to capture all cases where cuBLAS doesn't actually use TF32, leading to silently degraded accuracy for certain workloads. Contributor PaulZhang12 resolved this by analytically reverse-engineering cuBLAS's internal heuristics, processing over 2 million tensor shapes and layouts to map exactly when TF32 is applied. The resulting logic now correctly gates TF32 usage, eliminating the regression without sacrificing performance.
The fix was approved by reviewers nmacchioni and eellison and merged into PyTorch's main branch. For developers using PyTorch with NVIDIA GPUs, this change prevents unexpected accuracy drops when enabling TF32 for mixed-precision training or inference, especially under automatic tuning configurations. The approach of data-driven reverse engineering demonstrates a rigorous methodology for maintaining numerical safety in high-performance compilers. Users should see more reliable results when using max-autotune with models that depend on precise TF32 behavior.
- Previous heuristics missed cuBLAS configurations that skip TF32, causing numerical regressions in max-autotune
- New heuristics derived from reverse-engineering cuBLAS using 2 million+ tensor shapes and layouts
- Fix prevents silent accuracy degradation while preserving TF32 performance benefits
Why It Matters
Ensures TF32 acceleration doesn't silently degrade model accuracy in PyTorch's max-autotune mode.