Developer Tools

viable/strict/1773495265: [DTensor] Add nansum sharding strategy and test (#176819)

New sharding strategy handles NaN values across GPUs, improving distributed training robustness.

Deep Dive

The PyTorch development team has merged a significant update to their DTensor distributed computing framework, adding a dedicated sharding strategy for the 'nansum' operation. DTensor is PyTorch's system for partitioning large tensors across multiple devices (like GPUs) to enable massive-scale model training. The new strategy, documented in pull request #176819, ensures that when performing a sum that ignores Not-a-Number (NaN) values, the calculation works correctly even when the input data is split across different hardware nodes. This is crucial for maintaining mathematical correctness in distributed environments.

Technically, the update defines how the 'aten.nansum.default' operation propagates sharding specifications—rules like Partial (P) to Partial (P) or Partial to Replicate—across a computational graph. The validation results show the implementation is highly accurate, with tests on a CUDA device using torch.float32 data types across a world size of 2 returning 93 correct sharding combinations, 0 incorrect, and only 5 missing edge cases. This work, approved by core maintainers, fills a gap in DTensor's operator coverage.

For machine learning engineers, this means greater stability when training models on large, messy datasets. If a data pipeline produces NaN values (from corrupted files, failed sensors, or division by zero), the nansum operation can safely aggregate results without those values poisoning the entire calculation. Previously, running such an operation on a sharded DTensor might have failed or produced incorrect gradients. Now, distributed training jobs can be more resilient, reducing the need for extensive data cleaning and pre-processing steps before scaling up.

Key Points
  • Adds a 'nansum' sharding strategy to PyTorch's DTensor for distributed computing (PR #176819).
  • Validation showed 93 correct, 0 incorrect, and 5 missing sharding combinations for the operation.
  • Enables robust distributed training by correctly handling NaN values across partitioned tensors on multiple GPUs.

Why It Matters

Makes large-scale AI training more fault-tolerant with real-world, imperfect data, reducing crashes and manual preprocessing.