Gemma 4 31B at 256K Full Context on a Single RTX 5090 — TurboQuant KV Cache Benchmark
New 3-bit KV cache compression unlocks full 256K context on consumer hardware, hitting 899 tokens/sec.
A technical breakthrough demonstrates that Google's powerful Gemma 4 31B model can now operate at its full 256,000-token context window on a single consumer-grade NVIDIA RTX 5090 GPU. This was achieved using a novel compression technique called TurboQuant KV cache, specifically the 'turbo3' variant, which applies 3-bit PolarQuant compression combined with Hadamard rotation to the model's key-value (KV) cache. The result is a dramatic ~4.5x compression ratio compared to standard 16-bit storage, allowing the massive memory requirements of the long context to fit within the GPU's 32GB of VRAM, using 27.7GB with headroom to spare.
Performance benchmarks show the system processing prompts at 899.55 tokens per second at the full 256K context length, with token generation holding steady at 61.5 tokens/second regardless of context size. The setup required custom fixes to the llama.cpp codebase, including patching a Microsoft Visual C++ compiler bug related to boolean array reading that affected Gemma 4's hybrid sliding window attention architecture. This achievement highlights a significant trend: sophisticated software optimizations and compression algorithms are rapidly closing the gap between data center-scale AI and what's possible on high-end consumer hardware, potentially democratizing access to frontier model capabilities for developers and researchers.
- TurboQuant's 'turbo3' KV cache compression reduces memory footprint by ~4.5x, enabling 256K context on 32GB VRAM.
- The system achieved 899.55 tokens/sec prompt processing at full context, with generation at a constant 61.5 tokens/sec.
- Custom code fixes were required for Windows/MSVC builds to support Gemma 4's hybrid attention architecture.
Why It Matters
This makes running massive-context, state-of-the-art models feasible on single high-end consumer GPUs, lowering the barrier for advanced AI development.