Developer Tools

trunk/89f3759429b96a8693b698f013990240bb4e25b3: [reland] c10d: convert NanCheck to an op + tests (#174736) (#174990)

The CUDA kernel assertion now logs helpful messages when NaN values trigger in distributed training.

Deep Dive

Meta's PyTorch team converted NanCheck into a standalone operator (op) in commit 89f3759, making it accessible outside ProcessGroupNCCL for tools like torchcomms. Key specs include a new CPU implementation and using the CUDA_KERNEL_ASSERT macro for clearer error logging. Users can now proactively detect NaN (Not a Number) errors in distributed GPU training, preventing silent corruption and improving debugging efficiency across multi-GPU systems.

Why It Matters

This prevents silent model corruption during large-scale distributed AI training, saving hours of debugging.