Developer Tools

PyTorch fixes SAC FIFO queue mismatch for inductor compiled code

A single shared queue caused wrong cached gradients, crashing DTensor training.

Deep Dive

PyTorch has patched a subtle bug in its Selective Activation Checkpointing (SAC) system that caused wrong cached values to be returned when using `wrap_inductor_compiled_regions=True`. The issue, tracked as #175258, occurred because all `inductor_compiled_code` Higher Order Operator (HOP) calls shared a single FIFO queue keyed only by the HOP identity. Under SAC, during the recompute phase, if a compiled region was skipped (e.g., due to a global cache hit), the queue returned the next cached output — but that output belonged to a different region, not the one actually executing. The mismatch caused `DTensor.__tensor_unflatten__` to fail with a `RuntimeError` because an int32 tensor (slot 1) was consumed in place of a float DTensor (slot 2), which crashed gradient computation.

The root cause was straightforward: SAC's internal cache used `defaultdict(list)` keyed by `func`, so all `inductor_compiled_code` calls — regardless of which wrapped function they represented — collapsed into one list. During forward passes, entries were appended in order; during recompute, `list.pop(0)` was called per region execution. When a region was skipped, its corresponding `pop` never happened, so the queue became misaligned for the remaining regions. The fix, committed by aorenste, introduces `_sac_storage_key(func, args)` in `torch/utils/checkpoint.py`. For `inductor_compiled_code`, it returns `(func, callable.idx)` to give each compiled region its own queue; for all other ops, it returns the original `func`. This ensures that skipping one region during recompute doesn't corrupt another region's FIFO queue. A regression test (`test_sac_cached_value_fifo_mismatch`) was added to `test/dynamo/test_wrap_inductor_compiled_regions.py` to prevent future recurrences.

Key Points
  • All `inductor_compiled_code` calls shared one FIFO queue keyed only by HOP identity, causing wrong cached values when a region was skipped during SAC recompute.
  • Fix adds `_sac_storage_key()` that creates separate queues per callable by including `callable.idx` as part of the key.
  • Resolves `RuntimeError: Only Tensors of floating point and complex dtype can require gradients` in DTensor training with gradient checkpointing and compiled mode.

Why It Matters

Enables correct gradient checkpointing in compiled PyTorch, preventing crashes in distributed training workflows.