TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing
New system supports 2.7x more concurrent agents than vLLM by eliminating redundant memory usage.
A research team from Peking University and other institutions has introduced TokenDance, a novel system designed to overcome a critical bottleneck in multi-agent LLM applications. These applications, where multiple AI agents work together in synchronized rounds, suffer from massive redundancy. Each agent receives the same shared context from a central scheduler, causing identical Key-Value (KV) cache data to be stored repeatedly across all agents. TokenDance directly exploits this 'All-Gather' communication pattern to enable collective KV cache sharing across an entire round of agent execution in a single step.
TokenDance's core innovation is its 'Diff-Aware Storage' engine. Instead of storing complete, duplicate KV caches for each agent, the system maintains a single master copy of shared context blocks. For each agent, it then stores only the sparse differences (diffs) from this master, achieving an impressive 11-17x compression on representative workloads. Evaluations on benchmarks like GenerativeAgents and AgentSociety show concrete results: TokenDance supports up to 2.7x more concurrent agents than the popular vLLM framework (with prefix caching) under the same service-level objective, reduces per-agent KV cache storage by up to 17.5x, and speeds up the initial 'prefill' phase of processing by up to 1.9x compared to other caching methods.
- Enables collective KV cache sharing for multi-agent LLMs, paying the reuse cost once per round instead of per agent.
- Uses 'Diff-Aware Storage' to encode caches as block-sparse diffs, achieving 11-17x compression on workloads.
- Benchmarks show 2.7x more concurrent agents than vLLM, 17.5x less storage per agent, and 1.9x prefill speedup.
Why It Matters
This breakthrough makes complex, multi-agent AI applications far more scalable and cost-effective to deploy at scale.