trunk/89f3759429b96a8693b698f013990240bb4e25b3: [reland] c10d: convert NanCheck to an op + tests (#174736) (#174990)
The CUDA kernel assertion now logs helpful messages when NaN values trigger in distributed training.
Deep Dive
Meta's PyTorch team converted NanCheck into a standalone operator (op) in commit 89f3759, making it accessible outside ProcessGroupNCCL for tools like torchcomms. Key specs include a new CPU implementation and using the CUDA_KERNEL_ASSERT macro for clearer error logging. Users can now proactively detect NaN (Not a Number) errors in distributed GPU training, preventing silent corruption and improving debugging efficiency across multi-GPU systems.
Why It Matters
This prevents silent model corruption during large-scale distributed AI training, saving hours of debugging.