Research & Papers

PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference

arXiv cs.DC April 29, 2026

⚡15 agents share a single KV cache, cutting 19.8 GB to 0.45 GB.

Deep Dive

PolyKV, a new system from researchers Patel and Joshi, tackles the memory explosion problem in multi-agent LLM inference. Instead of allocating a separate KV cache per agent—the current standard—PolyKV writes a single compressed cache and shares it across up to 15 concurrent agents using HuggingFace DynamicCache objects. The asymmetric compression treats keys and values differently: keys are quantized at int8 (q8_0) to preserve softmax stability, while values undergo TurboQuant MSE—a Fast Walsh-Hadamard Transform rotation followed by 3-bit Lloyd-Max quantization with centroids tuned to N(0,1). This achieves a consistent 2.91x compression ratio across model scales (SmolLM2-1.7B and Llama-3-8B) and context lengths from 600 to 7,194 tokens.

In the most dramatic test, running Llama-3-8B with 15 agents sharing a 4K-token context, PolyKV reduced KV cache memory from 19.8 GB to just 0.45 GB—a 97.7% savings—while maintaining only +0.57% perplexity degradation and a mean BERTScore F1 of 0.928. Notably, the perplexity penalty does not grow with agent count and actually inverts to -0.26% at longer contexts (1,851 coherent tokens), suggesting the technique may improve as contexts lengthen. This is the first work to combine a single shared, lossy-compressed KV pool with multi-reader concurrent agent access, making it a strong candidate for scaling agent-based LLM systems.

Key Points

PolyKV reduces KV cache memory by 97.7% (19.8 GB to 0.45 GB) for 15 concurrent Llama-3-8B agents sharing a 4K-token context.
Asymmetric compression: keys at int8 for softmax stability, values at 3-bit via TurboQuant (FWHT + Lloyd-Max), achieving a consistent 2.91x compression ratio.
Perplexity degradation is only +0.57% and doesn't scale with agent count; it improves to -0.26% at longer contexts (1,851 tokens).

Why It Matters

Enables memory-efficient multi-agent LLM systems, cutting costs and scaling agent concurrency without major accuracy loss.

Read Original Article

PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference

Why It Matters

Stay Ahead in AI