Previous heuristics missed cuBLAS configurations that skip TF32, causing numerical regressions in max-autotune?

Previous heuristics missed cuBLAS configurations that skip TF32, causing numerical regressions in max-autotune

New heuristics derived from reverse-engineering cuBLAS using 2 million+ tensor shapes and layouts?

New heuristics derived from reverse-engineering cuBLAS using 2 million+ tensor shapes and layouts

Fix prevents silent accuracy degradation while preserving TF32 performance benefits?

Fix prevents silent accuracy degradation while preserving TF32 performance benefits

Developer Tools

PyTorch fixes TF32 heuristics to prevent numerical regressions in max-autotune

PyTorch Releases May 13, 2026

⚡Reverse-engineering cuBLAS from over 2 million shapes to fix TF32 accuracy.

Deep Dive

PyTorch has merged a critical fix to its Inductor compiler's ALLOW_TF32 heuristics, addressing numerical regressions that occurred when using TensorFloat32 (TF32) with max-autotune mode. The issue was that previous heuristics relied on general alignment rules that failed to capture all cases where cuBLAS doesn't actually use TF32, leading to silently degraded accuracy for certain workloads. Contributor PaulZhang12 resolved this by analytically reverse-engineering cuBLAS's internal heuristics, processing over 2 million tensor shapes and layouts to map exactly when TF32 is applied. The resulting logic now correctly gates TF32 usage, eliminating the regression without sacrificing performance.

The fix was approved by reviewers nmacchioni and eellison and merged into PyTorch's main branch. For developers using PyTorch with NVIDIA GPUs, this change prevents unexpected accuracy drops when enabling TF32 for mixed-precision training or inference, especially under automatic tuning configurations. The approach of data-driven reverse engineering demonstrates a rigorous methodology for maintaining numerical safety in high-performance compilers. Users should see more reliable results when using max-autotune with models that depend on precise TF32 behavior.

Key Points

Previous heuristics missed cuBLAS configurations that skip TF32, causing numerical regressions in max-autotune
New heuristics derived from reverse-engineering cuBLAS using 2 million+ tensor shapes and layouts
Fix prevents silent accuracy degradation while preserving TF32 performance benefits

Why It Matters

Ensures TF32 acceleration doesn't silently degrade model accuracy in PyTorch's max-autotune mode.

Read Original Article

PyTorch fixes TF32 heuristics to prevent numerical regressions in max-autotune

Why It Matters

Related Articles

🚀 Stay Ahead in AI