Register-resident design?

no shared memory or global staging, only final K writes hit global memory

On B200, K=16 achieves 2.24x geomean speedup (up to 4.45x) with zero regressions across N from 64 to 2048?

On B200, K=16 achieves 2.24x geomean speedup (up to 4.45x) with zero regressions across N from 64 to 2048

deterministic tie order by preserving full original index, unlike some lossy optimizations

Developer Tools

PyTorch's register-resident topk kernel speeds up small-K operations by 4.45x

PyTorch Releases June 02, 2026

⚡New CUDA kernel eliminates shared memory overhead, achieving 2.24x geomean speedup on B200.

Deep Dive

PyTorch's latest contribution to its native CUDA DSL stack introduces a register-resident topk kernel optimized for small K (16 and 32) and small N (up to 2048). Authored by Claude and merged into the main branch, this kernel fundamentally rethinks the data path: instead of staging keys through shared memory or global memory, it keeps all keys in registers and writes only the final K values to global memory. This design eliminates the latency and memory bandwidth overhead of intermediate storage, making it particularly effective when the radix-based kernel cannot run (K<64) and the one-CTA-per-row approach suffers from launch overhead.

The performance gains are substantial on NVIDIA B200 hardware. For K=16 across a range of N values (64–2048), the kernel delivers a geometric mean speedup of 2.24× with individual benchmarks ranging from 1.16× to an impressive 4.45×. For K=32 with N=256, the geometric mean speedup is 1.04× with no regressions across the entire eligible (K, N, M) grid. Beyond K=32, the kernel gracefully falls back to the existing radix kernel (for K=64–1024) or the aten implementation. The dispatcher in cutedsl_impl.py handles this automatically. A key design choice is losslessness: unlike some aggressive optimizations that sacrifice accuracy, this kernel preserves the full original index for each key, ensuring bit-exact output with deterministic tie order, at the cost of slightly lower raw throughput compared to quack's small-K topk.

Key Points

Register-resident design: no shared memory or global staging, only final K writes hit global memory
On B200, K=16 achieves 2.24x geomean speedup (up to 4.45x) with zero regressions across N from 64 to 2048
Bit-exact with aten: deterministic tie order by preserving full original index, unlike some lossy optimizations

Why It Matters

Faster small-K topk directly accelerates attention mechanisms and other critical transformer operations in PyTorch.

Read Original Article

PyTorch's register-resident topk kernel speeds up small-K operations by 4.45x

Why It Matters

Related Articles

🚀 Stay Ahead in AI