[P] GPU friendly lossless 12-bit BF16 format with 0.03% escape rate and 1 integer ADD decode works for AMD & NVIDIA
New format shrinks BF16 weights by 1.33x, enabling 2.7x faster multi-user inference on standard GPUs.
Independent researcher cenconq25 has open-sourced Turbo-Lossless, a groundbreaking compression format designed to make large language models (LLMs) faster and more efficient during inference. The core innovation is a lossless method that stores standard BF16 weights in just 12 bits—not the more common 11—by replacing the 8-bit exponent with a clever 4-bit group code. Crucially, for 99.97% of weights, decoding requires only a single integer ADD operation, and storage is byte-aligned with zero HBM read amplification. This allows the format to be "GPU-friendly," enabling a fused decode+GEMM kernel that processes compressed weights directly, effectively removing a separate decompression stage.
Benchmark results are impressive, showing significant speedups over the popular vLLM framework. In multi-user scenarios (batch size 256), Turbo-Lossless delivered 2.93x higher total tokens/second for a Mistral 7B model. It also proved stable across diverse model architectures, from the massive Llama 3.1 405B (0.034% escape rate) to diffusion models like SDXL. The format is hardware-agnostic, working on both NVIDIA and AMD GPUs, and its V3 kernel uses tensor-core patterns inspired by recent academic work like ZipGEMM. While currently tested on BF16 safetensors only, this prototype points to a future where model size and inference speed are no longer a direct trade-off.
- Achieves true 12-bit per weight storage, providing a 1.33x size reduction over BF16 with zero precision loss.
- Enables fused decode+matmul; 99.97% of weights decode with one integer ADD, removing decompression overhead.
- Demonstrated 2.93x higher multi-user throughput vs. vLLM and runs on both NVIDIA and AMD consumer GPUs.
Why It Matters
This could drastically reduce the cost and latency of serving LLMs, making advanced AI more accessible on standard hardware.