PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference
15 agents share a single KV cache, cutting 19.8 GB to 0.45 GB.
PolyKV, a new system from researchers Patel and Joshi, tackles the memory explosion problem in multi-agent LLM inference. Instead of allocating a separate KV cache per agent—the current standard—PolyKV writes a single compressed cache and shares it across up to 15 concurrent agents using HuggingFace DynamicCache objects. The asymmetric compression treats keys and values differently: keys are quantized at int8 (q8_0) to preserve softmax stability, while values undergo TurboQuant MSE—a Fast Walsh-Hadamard Transform rotation followed by 3-bit Lloyd-Max quantization with centroids tuned to N(0,1). This achieves a consistent 2.91x compression ratio across model scales (SmolLM2-1.7B and Llama-3-8B) and context lengths from 600 to 7,194 tokens.
In the most dramatic test, running Llama-3-8B with 15 agents sharing a 4K-token context, PolyKV reduced KV cache memory from 19.8 GB to just 0.45 GB—a 97.7% savings—while maintaining only +0.57% perplexity degradation and a mean BERTScore F1 of 0.928. Notably, the perplexity penalty does not grow with agent count and actually inverts to -0.26% at longer contexts (1,851 coherent tokens), suggesting the technique may improve as contexts lengthen. This is the first work to combine a single shared, lossy-compressed KV pool with multi-reader concurrent agent access, making it a strong candidate for scaling agent-based LLM systems.
- PolyKV reduces KV cache memory by 97.7% (19.8 GB to 0.45 GB) for 15 concurrent Llama-3-8B agents sharing a 4K-token context.
- Asymmetric compression: keys at int8 for softmax stability, values at 3-bit via TurboQuant (FWHT + Lloyd-Max), achieving a consistent 2.91x compression ratio.
- Perplexity degradation is only +0.57% and doesn't scale with agent count; it improves to -0.26% at longer contexts (1,851 tokens).
Why It Matters
Enables memory-efficient multi-agent LLM systems, cutting costs and scaling agent concurrency without major accuracy loss.