trunk/603e24483800ba9be0cd9763ceb0ac3533686fb6: [CUDA Graph] Per-capture RNG state (#176752)
Critical fix resolves nested CUDA graph issues that broke torch.cond() and concurrent training.
The PyTorch team has resolved a long-standing technical hurdle in its CUDA Graph execution engine with the merge of pull request #176752. The core issue was that the Random Number Generator (RNG) state—critical for reproducibility in training—was tracked globally per generator. This architecture broke down during nested or concurrent CUDA graph captures, such as those created by the `torch.cond()` operator for conditional logic. The system couldn't maintain independent RNG timelines for separate captures sharing the same generator, leading to incorrect results and failed executions.
The fix, a re-submission of earlier work by contributor @galv, fundamentally refactors the state management. It introduces a new `CUDAGeneratorCaptureState` struct to own the GPU RNG tensors and offset for each individual capture. The main `CUDAGeneratorState` now holds a hash map (`capture_states_`) keyed by a `CaptureId_t`, allowing each capture (like different branches of a conditional) to have its own isolated RNG timeline. This explicit per-capture model makes the previous workaround system, `conditional_rng_snapshots_`, obsolete, leading to a cleaner and more robust architecture.
This is a low-level but profoundly important change for AI researchers and engineers. It directly fixes GitHub issues #172390 and #171263, which blocked the use of CUDA graphs with conditional operations. By ensuring RNG correctness in complex capture scenarios, PyTorch removes a major barrier to using CUDA graphs for performance optimization in modern, dynamic neural network architectures, enabling both faster training and guaranteed reproducibility where it previously failed.
- Shifts RNG state tracking from per-generator to per-capture, using a unique CaptureId_t key.
- Fixes critical bugs (#172390, #171263) that broke torch.cond() and concurrent CUDA graph captures.
- Enables correct, reproducible training for complex models with conditional logic by isolating RNG timelines.
Why It Matters
Unlocks CUDA graph acceleration for dynamic AI models, ensuring performance and reproducibility in complex training pipelines.