PyTorch CUDA kernel update marks helpers static for better optimization
This commit could speed up PyTorch CUDA kernels by reducing symbol visibility and improving compiler optimizations.
cyyever tagged a commit that adds `static` to free functions in `aten/src/ATen/native/cuda/` whose body does not contain `GPU_LAMBDA` or `__device__` lambdas, are not declared in any header, and are not referenced from any other translation unit. The commit skips kernels using extended `__device__` lambdas because MSVC requires the enclosing function to have external linkage (per CUDA C++ Programming Guide section 5.3.8.4 item 10). The change was authored with Claude.
- The commit adds `static` to free functions in PyTorch's CUDA kernel helpers that do not use __device__ lambdas.
- Functions affected are those in `aten/src/ATen/native/cuda/` not declared in headers or referenced from other translation units.
- Kernels using extended __device__ lambdas are skipped due to MSVC requirements for external linkage.
Why It Matters
Improves PyTorch's CUDA kernel compilation efficiency and reduces potential linker issues, benefiting all GPU-accelerated deep learning users.