Developer Tools

viable/strict/1777298193: [Inductor] Add kernel_num_gb kernel_flop for combo kernels (#180813)

PyTorch Releases April 27, 2026

⚡Combo kernels now report accurate performance stats for better profiling.

Deep Dive

PyTorch has merged PR #180813, a critical update for its Inductor compiler that adds `kernel_num_gb` and `kernel_flop` metadata to combo kernels. Combo kernels, which fuse multiple sub-kernels for efficiency, previously had this metadata dropped, leaving them without bandwidth or FLOP counts in profiler records and autotune bandwidth logs. This fix computes these values as the sum of each sub-kernel's `estimate_kernel_num_bytes()` and `estimate_flops()`, ensuring accurate performance tracking.

This enhancement is vital for developers optimizing AI models on PyTorch, as it provides granular insights into memory bandwidth and computational intensity of fused operations. With `config.benchmark_kernel` and `config.profile_bandwidth` enabled, users can now profile combo kernels effectively, leading to better autotuning and performance tuning. The PR was approved by eellison and builds on dependencies like #180787.

Key Points

Adds `kernel_num_gb` and `kernel_flop` to combo kernel `inductor_meta`.
Metadata is computed as the sum of each sub-kernel's `estimate_kernel_num_bytes()` and `estimate_flops()`.
Fixes missing bandwidth and FLOP data in profiler records and autotune logs.

Why It Matters

Accurate combo kernel profiling enables better autotuning and performance optimization for PyTorch models.

Read Original Article

viable/strict/1777298193: [Inductor] Add kernel_num_gb kernel_flop for combo kernels (#180813)

Why It Matters

Stay Ahead in AI