Developer Tools

trunk/877ba3c498e471b4a3199943e3febfaf9f7cb77f: [dynamo] Fix tensorify recompiles for method-form SymFloat ops (#179395)

PyTorch patch stops recompilations on every new scalar value with backed float inputs

Deep Dive

PyTorch has merged a fix to address a common recompilation headache in its dynamo JIT compiler. The issue (tracked as #169634) occurred when backed float inputs—scalars that call into Python that PyTorch tracks symbols for—were not fully tensorified during the joint FX pass. Two patterns were left behind: method-form in-place tensor ops like `Tensor.mul_()` and `Tensor.div_()`, and composite SymFloat expressions built from Python operators such as `0.5 * lr` or `beta2 ** 0.5`. When the later specialization sweep encountered these untouched patterns, it concluded that the backed floats had not been tensorified, restarting analysis with those values specialized. That meant every new scalar value triggered a costly recompile, breaking the usual dynamic shape optimizations.

The proposed fix—authored by @bobrenjc93 and approved by @laithsakka—teaches `tensorify_python_scalars` to handle these cases directly. It now recognizes method-form ops and maps them to their corresponding in-place ATen overloads (e.g., `mul_` → `aten.mul_.Tensor`), preserving optimizer-style mutation without inserting an extra `copy_` node. It also detects composite SymFloat expressions built from Python operators and records the underlying backed float symbols, not just the composite expression node, when tensorification succeeds. This keeps backed floats dynamic until the joint FX pass can lower them into tensor compute, avoiding the specialization trap. The result: stable compiled optimizer kernel counts and no more recompiles on every new scalar value in training loops.

Key Points
  • Fix targets backed float inputs that caused recompiles on every new scalar value (issue #169634).
  • Extends tensorification to method-form ops (mul_, add_, sub_, div_) and composite SymFloat expressions from Python operators.
  • Uses in-place ATen overloads to avoid extra copy_ nodes, keeping compiled optimizer kernel counts stable.

Why It Matters

ML engineers get faster training loops with fewer recompiles, especially when using dynamic shapes and optimizer scalars.