Research & Papers

ZipServ: Fast and Memory-Efficient LLM Inference with Hardware-Aware Lossless Compression

New hardware-aware compression framework actually speeds up LLM inference by 2.21x at kernel level.

Deep Dive

A research team led by Ruibo Fan has introduced ZipServ, a novel lossless compression framework specifically co-designed for efficient Large Language Model inference on GPUs. The system addresses a critical bottleneck: traditional entropy codecs produce variable-length bitstreams that break the SIMT (Single Instruction, Multiple Threads) parallelism essential for GPU performance, causing significant slowdowns. ZipServ's breakthrough is Tensor-Core-Aware Triple Bitmap Encoding (TCA-TBE), a fixed-length compression format that enables constant-time, parallel decoding, perfectly aligning with GPU architecture.

At the core of ZipServ's performance gain is the ZipGEMM kernel, which fuses decompression with the core matrix multiplication (GEMM) operation. Instead of decompressing weights into a temporary buffer—creating redundant memory traffic—ZipGEMM decompresses them on-the-fly directly into the high-speed Tensor Core registers. This "load-compressed, compute-decompressed" design maximizes compute intensity by eliminating intermediate data movement, the primary performance limiter in modern systems.

The results are substantial. ZipServ reduces model memory footprint by up to 30% while simultaneously accelerating computation. It achieves a remarkable 2.21x kernel-level speedup over NVIDIA's highly optimized cuBLAS library. In end-to-end inference tests using the vLLM serving engine, ZipServ delivered an average 1.22x speedup, proving that lossless compression can provide both storage savings and performance gains—a combination previously thought to be a trade-off. The paper has been accepted for presentation at ASPLOS '26, a top-tier systems conference.

Key Points
  • Uses TCA-TBE fixed-length encoding for 30% model compression with parallel, constant-time GPU decoding.
  • ZipGEMM kernel fuses decompression & computation, streaming weights directly into Tensor Core registers to cut memory traffic.
  • Achieves 2.21x kernel speedup over cuBLAS and 1.22x faster end-to-end inference vs. vLLM, breaking the compression-speed trade-off.

Why It Matters

Enables cheaper, faster deployment of large models like GPT-4 and Llama 3 by reducing GPU memory requirements and boosting throughput.