Developer Tools

trunk/f8009afe078d8cd8fb78cb00ed2c4afd7415f388: Fix unbounded DTensor sharding propagation cache growth (#178301)

A bug causing unbounded memory growth during distributed AI training has been patched in PyTorch's core.

Deep Dive

Meta's PyTorch team has resolved a significant performance bug (issue #178301) in the framework's distributed tensor (DTensor) system. The issue was in the C++ DTensor dispatch fast path, specifically the NativeShardingPropagatorCache. This cache was incorrectly hashing non-tensor items inside list arguments—like the scalar values in optimizer operations—into its cache key. For foreach optimizer operations such as `_foreach_div_.ScalarList` and `_foreach_addcdiv_.ScalarList`, these scalar lists contain values that change every training step (e.g., AdamW bias corrections). This caused the system to create a brand new cache entry on every single training iteration, leading to unbounded memory growth as OpStrategy, OpSpec, and OutputSharding objects accumulated indefinitely.

The fix, authored with assistance from Anthropic's Claude AI, applies consistent logic to the cache's key generation. It extends the existing `static_argnum` filtering—already used for top-level non-tensor arguments—to non-tensor items nested inside lists. Since foreach operations are registered with `static_argnum=100` by default, their scalar list values are now correctly excluded from the cache key. This allows the cache to stabilize after an initial warmup phase, preventing the memory leak. The patch is crucial for maintaining efficient, long-running distributed training jobs on PyTorch without suffering from gradual performance degradation or out-of-memory crashes.

Key Points
  • Fixed a memory leak in PyTorch's DTensor NativeShardingPropagatorCache (PR #178301) that created new cache entries every training step.
  • The bug specifically affected foreach optimizer ops (e.g., `_foreach_div_.ScalarList`) due to hashing of step-varying scalar values.
  • Solution applies `static_argnum` filtering to list items, stabilizing cache growth and preventing indefinite memory accumulation.

Why It Matters

This prevents out-of-memory crashes and performance degradation in long-running, distributed AI training jobs, ensuring stable production workloads.