Developer Tools

PyTorch PR #183334 fixes Triton kernel eager_input_vals propagation bug

A new fix ensures correct metadata propagation for Triton kernels when decomposition changes node counts.

Deep Dive

PyTorch has merged a crucial fix for the Triton kernel compilation pipeline in PR #183334. The issue stemmed from the `replace_by_example` function, which previously assumed that the eager trace and the value (val) trace would produce replacement graphs with the same number of nodes. This assumption breaks when the decomposition for `triton_kernel_wrapper_functional` clones an `as_strided` view that has a dynamic size-1 leading dimension, causing a mismatch in node count and leading to incorrect `eager_input_vals` propagation.

The fix modifies `replace_by_example` to return the inserted replacement nodes, allowing decompositions to attach metadata (like `eager_input_vals`) directly when needed. This change is specifically applied to the `triton_kernel_wrapper_functional` path, which now propagates `eager_input_vals` directly to the inserted `triton_kernel_wrapper_mutation` node. The existing generic propagation for `auto_functionalized` paths is left unchanged because the team could not reproduce the bug for that case and lacked sufficient understanding to justify modifying it. The PR also includes a regression test for the Triton failure case and a direct test for the `replace_by_example` return value, ensuring the fix is robust.

Key Points
  • Fixes a bug where `replace_by_example` assumed equal node counts between eager and val traces, failing for Triton kernel decompositions with dynamic size-1 leading dimensions.
  • The fix returns inserted replacement nodes so decompositions can attach `eager_input_vals` metadata explicitly, applied only to `triton_kernel_wrapper_functional`.
  • Includes regression tests for Triton and direct tests for `replace_by_example` return values, authored with Claude and Codex.

Why It Matters

Ensures correct Triton kernel compilation in PyTorch, preventing silent errors in dynamic shape scenarios for production ML workloads.