viable/strict/1777298193: [Inductor] Add kernel_num_gb kernel_flop for combo kernels (#180813)
Combo kernels now report accurate performance stats for better profiling.
PyTorch has merged PR #180813, a critical update for its Inductor compiler that adds `kernel_num_gb` and `kernel_flop` metadata to combo kernels. Combo kernels, which fuse multiple sub-kernels for efficiency, previously had this metadata dropped, leaving them without bandwidth or FLOP counts in profiler records and autotune bandwidth logs. This fix computes these values as the sum of each sub-kernel's `estimate_kernel_num_bytes()` and `estimate_flops()`, ensuring accurate performance tracking.
This enhancement is vital for developers optimizing AI models on PyTorch, as it provides granular insights into memory bandwidth and computational intensity of fused operations. With `config.benchmark_kernel` and `config.profile_bandwidth` enabled, users can now profile combo kernels effectively, leading to better autotuning and performance tuning. The PR was approved by eellison and builds on dependencies like #180787.
- Adds `kernel_num_gb` and `kernel_flop` to combo kernel `inductor_meta`.
- Metadata is computed as the sum of each sub-kernel's `estimate_kernel_num_bytes()` and `estimate_flops()`.
- Fixes missing bandwidth and FLOP data in profiler records and autotune logs.
Why It Matters
Accurate combo kernel profiling enables better autotuning and performance optimization for PyTorch models.