Developer Tools

trunk/00ab8be510ed8e0b153d72126b799ea12996d69c: [DTensor] set device index only for existing devices (#174845)

A subtle PyTorch bug could have crashed your multi-GPU training runs...

Deep Dive

PyTorch merged a critical fix (PR #174845) to its DTensor system, preventing crashes when initializing process groups. The bug occurred in `DTensorContinuousTestBase` where `set_device_index(rank)` was called without first verifying enough physical GPUs existed on the system. This could cause training jobs to fail unexpectedly during distributed setup. The fix ensures the device index is only set for existing devices, improving stability for multi-GPU and distributed machine learning workloads.

Why It Matters

This prevents silent crashes and wasted compute resources for developers running large-scale distributed model training.