PyTorch TorchInductor optimizes persistent reductions with smarter eviction policy
New GPU kernel optimization applies delayed eviction to persistent reductions, improving memory coalescing.
Deep Dive
The delayed eviction-policy decision is applied to persistent reductions as well as looped reductions, so coalesced last-use loads get evict_first while reused, broadcasted, and non-coalesced loads keep evict_last. This revives the codegen portion of stale PR #119622 and fixes #119523.
Key Points
- Extends the existing delayed eviction policy from looped reductions to persistent reductions in TorchInductor.
- Coalesced last-use loads are marked evict_first, while reused, broadcasted, or non-coalesced loads retain evict_last.
- Revives code from stale PR #119622 and fixes issue #119523, impacting GPU kernel code generation.
Why It Matters
Better GPU memory management in PyTorch leads to faster training and inference for models with frequent reduction operations.