Developer Tools

viable/strict/1773837300: Use FloorDiv and Mod instead of // and % on sympy exprs (#177051)

A subtle bug in PyTorch's inductor compiler was forcing unnecessary masking in Triton kernels, causing major slowdowns.

Deep Dive

The PyTorch team has resolved a significant performance regression in its TorchInductor compiler, the engine behind `torch.compile`. The core issue was a type mismatch: when the compiler's code used Python's native `//` (floor division) and `%` (modulo) operators on raw SymPy symbolic expressions, it created `sympy.Mod` and `sympy.floor` objects. These are different from PyTorch's custom `torch.utils._sympy.functions.Mod` and `FloorDiv` types, which are registered as axioms for the compiler's symbolic reasoning engine (`torch._check`).

This mismatch caused the `statically_known_true` function to fail to recognize valid divisibility constraints (like `seq_len % 128 == 0`). Consequently, the compiler assumed shapes were not divisible, forcing it to generate conservative, masked Triton kernels for GPU operations. This led to severe performance penalties, most notably a 2.5x slowdown in the `flex_attention` operator when using dynamic shapes. After an exhaustive search of 444 Python files and 917 raw operator occurrences, developers identified and fixed 16 problematic call sites across 8 files by replacing `//` and `%` with explicit calls to `Mod()` and `FloorDiv()`.

The impact is immediate and substantial. In a benchmark for `flex_attention` with a sequence length (S) of 2048 on a GB200 system, performance with dynamic shapes is restored from a regressed 1.07 ms back to the optimal 0.43 ms—matching the performance of static shapes. This fix also resolves related vectorization regressions, such as a 27% slowdown in RMSNorm on H100 GPUs that was addressed in a prior pull request (#175755). The correction ensures the compiler can correctly leverage user-provided shape constraints to generate the most efficient, unmasked GPU kernels possible.

Key Points
  • Fixed a type mismatch where Python's `//` and `%` created incorrect SymPy types, breaking the compiler's symbolic reasoning.
  • Eliminated a 2.5x performance regression in `flex_attention` with dynamic shapes, restoring runtime from 1.07 ms to 0.43 ms.
  • The fix required auditing 917 operator occurrences across 444 files, with changes made to 16 critical call sites in 8 files.

Why It Matters

This fix ensures PyTorch's `torch.compile` can generate optimal GPU code, preventing major, hidden performance regressions for models using dynamic shapes and attention.