viable/strict/1775724501: [CUDA Graph] Per-capture RNG state (#176752)
New per-capture RNG tracking enables nested CUDA graph captures without breaking random number generation.
The PyTorch team has merged a significant technical update (PR #176752) that fundamentally changes how CUDA Graphs handle random number generation (RNG) state during AI model training. Previously, RNG state was tracked at the generator level, which created problems when multiple CUDA graph captures occurred concurrently or in nested patterns—particularly when using torch.cond() for conditional execution. This architecture limitation meant that different captures sharing the same generator could interfere with each other's RNG timelines, leading to reproducibility issues and potential training failures.
The new implementation introduces a per-capture RNG state model where each CUDA graph capture gets its own CUDAGeneratorCaptureState object. This structure contains dedicated GPU tensors for RNG state and independently tracks the philox offset for that specific capture. The CUDAGeneratorState now maintains a hash map of capture states keyed by CaptureId_t, allowing multiple captures to coexist without conflict. This refactoring also eliminates the previous conditional_rng_snapshots_ mechanism, which was a workaround that's no longer necessary with the cleaner per-capture architecture.
This change directly addresses several GitHub issues (#172390, #171263, #168042) that users encountered when attempting advanced CUDA graph workflows. The fix is particularly important for researchers and engineers using PyTorch's more sophisticated features like torch.cond() within CUDA graph contexts, as it ensures deterministic random number generation even in complex, nested execution patterns. The implementation maintains backward compatibility while providing the foundation for more reliable distributed and concurrent training scenarios.
- Changes RNG state tracking from per-generator to per-capture architecture using CUDAGeneratorCaptureState objects
- Fixes critical issues #172390 and #171263 where nested CUDA graph captures corrupted RNG timelines
- Enables reliable use of torch.cond() and other conditional operations within CUDA graph contexts
Why It Matters
Enables more complex, concurrent AI training workflows in PyTorch without breaking random number generation reproducibility.