Developer Tools

trunk/2005577964bd39f0eed63068edb3788834ec09b5: [xpu][feature] Support fork-safe device_count by pyzes (#178496)

New patch enables stable multiprocessing for Intel XPU users, fixing critical fork safety issue.

Deep Dive

The PyTorch team has merged a significant update to improve Intel XPU (GPU) support in multiprocessing environments. Commit 2005577, authored by developer 'pyzes', introduces fork-safe device counting that prevents crashes when using Python's multiprocessing module with Intel GPUs. This addresses the 'poison fork' problem where child processes could inherit corrupted GPU state from parent processes, a common issue in data loading and parallel training pipelines.

The solution integrates the pyzes Python package to properly initialize Intel's Level Zero (L0) runtime in forked processes. The implementation carefully handles ZE_AFFINITY_MASK environment variables for device selection while maintaining backward compatibility with existing SYCL-based device counting. When pyzes isn't installed or when complex composite device masks are detected, the system gracefully falls back to the original c10::xpu::device_count implementation.

This technical fix represents a crucial stability improvement for PyTorch users leveraging Intel's discrete and integrated GPUs for AI workloads. By ensuring consistent device counting behavior across both iGPUs and dGPUs in multiprocessing scenarios, developers can now build more reliable distributed training systems without workarounds for fork-related crashes.

Key Points
  • Fixes 'poison fork' crashes in multiprocessing with Intel XPUs using pyzes package
  • Respects ZE_AFFINITY_MASK for device selection while maintaining SYCL compatibility
  • Enables stable distributed training on Intel GPUs without workarounds

Why It Matters

Enables reliable multiprocessing for PyTorch on Intel GPUs, crucial for distributed training and data loading pipelines.