Developer Tools

PyTorch CUDA kernel update marks helpers static for better optimization

This commit could speed up PyTorch CUDA kernels by reducing symbol visibility and improving compiler optimizations.

Deep Dive

cyyever tagged a commit that adds `static` to free functions in `aten/src/ATen/native/cuda/` whose body does not contain `GPU_LAMBDA` or `__device__` lambdas, are not declared in any header, and are not referenced from any other translation unit. The commit skips kernels using extended `__device__` lambdas because MSVC requires the enclosing function to have external linkage (per CUDA C++ Programming Guide section 5.3.8.4 item 10). The change was authored with Claude.

Key Points
  • The commit adds `static` to free functions in PyTorch's CUDA kernel helpers that do not use __device__ lambdas.
  • Functions affected are those in `aten/src/ATen/native/cuda/` not declared in headers or referenced from other translation units.
  • Kernels using extended __device__ lambdas are skipped due to MSVC requirements for external linkage.

Why It Matters

Improves PyTorch's CUDA kernel compilation efficiency and reduces potential linker issues, benefiting all GPU-accelerated deep learning users.