Developer Tools

trunk/946d524b43c4b982ec7e09bc1ecdeb66e715ed2b: Improving DTensor performance for torch.cat (#174879)

New caching system cuts torch.cat latency from 27.2μs to 16.1μs for distributed tensor operations.

Deep Dive

The PyTorch team at Meta has implemented a significant performance optimization for DTensor operations, specifically targeting the torch.cat function used for concatenating distributed tensors. Previously, when the C++ dispatcher identified that pytree (Python tree) structures were needed, it would route operations back to Python for processing, creating a performance bottleneck. The new implementation adds C++-side caching for pytree operations, matching the existing fast-path behavior used for non-pytree operations. This change addresses a critical performance gap where distributed tensor concatenation was unnecessarily slow due to repeated Python interpreter overhead.

The technical implementation involves modifying the create_native_op_schema function to handle pytree flattening directly in C++ rather than delegating to Python. The optimization specifically targets Tensor[], Tensor?[], and Tensor? argument types common in torch.cat operations. Benchmark results show a dramatic improvement from 27.2μs to 16.1μs latency (40% faster) after the initial cache-warming call. This brings DTensor pytree operations in line with non-pytree operations like aten::add, which already used the C++ fast path. The change is particularly impactful for distributed training workloads where torch.cat operations occur frequently across sharded tensors.

Key Points
  • C++ caching reduces torch.cat latency from 27.2μs to 16.1μs (40% improvement)
  • Eliminates Python routing overhead for pytree operations in distributed tensors
  • Brings DTensor pytree performance in line with existing non-pytree fast paths

Why It Matters

Faster distributed tensor operations accelerate large-scale ML training, reducing compute costs and iteration time.