Zipage: Maintain High Request Concurrency for LLM Reasoning through Compressed PagedAttention
New inference engine achieves 95% of full performance while more than doubling request speed.
A team of researchers, primarily from Microsoft, has introduced Zipage, a new inference engine designed to solve a critical bottleneck in serving large language models (LLMs) for complex reasoning tasks. The core problem is the KV (key-value) cache, a memory-intensive component that stores intermediate computations during text generation. For long, multi-step reasoning, this cache can consume massive amounts of GPU memory, severely limiting how many user requests (concurrency) a server can handle simultaneously. Existing solutions often involve evicting parts of the cache, but they can degrade output quality and aren't robust enough for production use.
Zipage's innovation is 'Compressed PagedAttention,' which intelligently combines token-wise KV cache eviction with the efficient memory management of PagedAttention (the technology behind vLLM). It employs a comprehensive scheduling strategy and supports features like prefix caching and asynchronous compression. This allows the system to dynamically manage memory by compressing less critical parts of the KV cache without significantly harming the model's reasoning capabilities. The result is a highly practical engine for industrial applications.
In benchmarks on large-scale mathematical reasoning tasks, Zipage demonstrated its effectiveness by achieving approximately 95% of the performance accuracy compared to engines using a full, uncompressed KV cache. Crucially, it did this while delivering over a 2.1x speedup, meaning it can process more than twice as many concurrent requests. This breakthrough directly addresses the trade-off between memory efficiency and output quality, making high-concurrency, reasoning-heavy AI services far more feasible and cost-effective to deploy at scale.
- Uses 'Compressed PagedAttention' to tackle the KV cache memory bottleneck during LLM reasoning.
- Achieves 95% of full KV cache performance while speeding up processing by 2.1x on math tasks.
- Designed as a practical, industrial-grade inference engine with support for prefix caching and async compression.
Why It Matters
Enables AI providers to serve more concurrent reasoning requests cost-effectively, a key hurdle for deploying advanced LLMs.