All `inductor_compiled_code` calls shared one FIFO queue keyed only by HOP identity, causing wrong cached values when a region was skipped during SAC recompute?

All `inductor_compiled_code` calls shared one FIFO queue keyed only by HOP identity, causing wrong cached values when a region was skipped during SAC recompute.

Fix adds `_sac_storage_key()` that creates separate queues per callable by including `callable.idx` as part of the key?

Fix adds `_sac_storage_key()` that creates separate queues per callable by including `callable.idx` as part of the key.

Resolves `RuntimeError?

Only Tensors of floating point and complex dtype can require gradients` in DTensor training with gradient checkpointing and compiled mode.

Developer Tools

PyTorch fixes SAC FIFO queue mismatch for inductor compiled code

PyTorch Releases May 11, 2026

⚡A single shared queue caused wrong cached gradients, crashing DTensor training.

Deep Dive

PyTorch has patched a subtle bug in its Selective Activation Checkpointing (SAC) system that caused wrong cached values to be returned when using `wrap_inductor_compiled_regions=True`. The issue, tracked as #175258, occurred because all `inductor_compiled_code` Higher Order Operator (HOP) calls shared a single FIFO queue keyed only by the HOP identity. Under SAC, during the recompute phase, if a compiled region was skipped (e.g., due to a global cache hit), the queue returned the next cached output — but that output belonged to a different region, not the one actually executing. The mismatch caused `DTensor.__tensor_unflatten__` to fail with a `RuntimeError` because an int32 tensor (slot 1) was consumed in place of a float DTensor (slot 2), which crashed gradient computation.

The root cause was straightforward: SAC's internal cache used `defaultdict(list)` keyed by `func`, so all `inductor_compiled_code` calls — regardless of which wrapped function they represented — collapsed into one list. During forward passes, entries were appended in order; during recompute, `list.pop(0)` was called per region execution. When a region was skipped, its corresponding `pop` never happened, so the queue became misaligned for the remaining regions. The fix, committed by aorenste, introduces `_sac_storage_key(func, args)` in `torch/utils/checkpoint.py`. For `inductor_compiled_code`, it returns `(func, callable.idx)` to give each compiled region its own queue; for all other ops, it returns the original `func`. This ensures that skipping one region during recompute doesn't corrupt another region's FIFO queue. A regression test (`test_sac_cached_value_fifo_mismatch`) was added to `test/dynamo/test_wrap_inductor_compiled_regions.py` to prevent future recurrences.

Key Points

All `inductor_compiled_code` calls shared one FIFO queue keyed only by HOP identity, causing wrong cached values when a region was skipped during SAC recompute.
Fix adds `_sac_storage_key()` that creates separate queues per callable by including `callable.idx` as part of the key.
Resolves `RuntimeError: Only Tensors of floating point and complex dtype can require gradients` in DTensor training with gradient checkpointing and compiled mode.

Why It Matters

Enables correct gradient checkpointing in compiled PyTorch, preventing crashes in distributed training workflows.

Read Original Article

PyTorch fixes SAC FIFO queue mismatch for inductor compiled code

Why It Matters

Related Articles

🚀 Stay Ahead in AI