Developer Tools

viable/strict/1775202158: Initial implementation of stateless RNG APIs (#177229)

New API enables deterministic random number generation across GPUs, fixing a major reproducibility pain point.

Deep Dive

PyTorch, the leading open-source machine learning framework, has taken a significant step towards improving reproducibility with the merge of a new stateless Random Number Generator (RNG) API. This initial implementation, introduced in pull request #177229, provides a JAX-like interface for managing randomness. The core functions—`torch.func._random.key()`, `.split()`, and `.fold_in()`—allow developers to explicitly create and manage random keys. Crucially, the underlying implementation fixes the `subsequence` parameter to zero, ensuring that random number generation is consistent regardless of the number of CUDA threads, input shapes, or specific GPU devices used. This directly addresses a long-standing challenge where identical PyTorch code could produce different results on different hardware configurations.

For practitioners, this means deterministic and reproducible training runs are now more achievable. The API supports arbitrarily batched keys, enabling advanced use cases like per-sample randomness for techniques in differential privacy or robust training. The provided code examples demonstrate how to safely consume keys to prevent reuse, a fundamental principle of stateless RNGs that enhances security and predictability. By adopting this pattern, PyTorch aligns more closely with the functional programming paradigms popularized by JAX, offering researchers a more reliable foundation for experiments that require exact reproducibility, from academic papers to production model debugging.

Key Points
  • Introduces a JAX-like stateless RNG API with `key()`, `split()`, and `fold_in()` functions for explicit randomness control.
  • Uses a custom Philox-4x32-10 implementation that guarantees consistent generation across GPU threads and devices, fixing a key reproducibility issue.
  • Enables deterministic training runs and supports batched keys for advanced use cases like per-sample randomness in secure or robust ML.

Why It Matters

This brings reliable, deterministic reproducibility to PyTorch, a critical requirement for scientific research, debugging, and deploying robust AI models.