Research & Papers

torch-nvenc-compress: GPU NVENC silicon as a PCIe bandwidth multiplier — PCA + pure-ctypes Video Codec SDK wrapper. Parallel-path overlap measured at 67% of theoretical max on a real GEMM + encode workload. [P]

A Python library compresses activations and KV cache on the fly using GPU video encoders, delivering 6.1x compression for diffusion models.

Deep Dive

Nvidia removed NVLink from the RTX 4090 and 5090, forcing multi-GPU setups to rely on PCIe peer-to-peer at ~30 GB/s—a severe bottleneck for splitting large models like a 70B LLM. Developer shootthesound (GitHub) built torch-nvenc-compress to solve this by repurposing the GPU's otherwise-idle video encode/decode silicon (NVENC/NVDEC) as a tensor codec. The library applies PCA and rank-truncation as preprocessing to reveal heavy-tailed channel covariance that the H.264 codec can exploit. A new pure-ctypes wrapper (DirectBackend) provides zero-copy from torch CUDA tensors and multi-engine pipelining across all three NVENC engines on an RTX 5090, avoiding the latency of FFmpeg subprocess calls.

Results are striking: lossless compression ratios of 6.1× on diffusion model activations (FLUX.2 Klein 9B) and 2.7× on LLM KV cache (Mistral 7B v0.3), validated across thousands of captures. Encoding speed reaches 0.243 ms per frame (256×256 YUV444) with the DirectBackend, and 0.180 ms using all three NVENC engines—a 7.9× improvement over an FFmpeg baseline. Parallel-path overlap was empirically measured at 67% of theoretical max when interleaving GEMM compute with encode streams. The repo includes documented null findings to save others from dead ends. This approach effectively turns NVENC into a bandwidth multiplier, making multi-GPU inference on consumer hardware practical again.

Key Points
  • 6.1x lossless compression on diffusion model activations (FLUX.2 Klein 9B mid-block)
  • 2.7x lossless compression on LLM KV cache (Mistral 7B v0.3) with 0.243 ms encode per frame
  • Pure-ctypes DirectBackend wrapper eliminates FFmpeg overhead, achieving 7.9x speedup over subprocess baseline

Why It Matters

Enables practical multi-GPU AI workloads without NVLink, multiplying effective PCIe bandwidth by 6x+.