Agent Frameworks

QKVShare: Quantized KV-Cache Handoff for Multi-Agent On-Device LLMs

Quantized KV cache handoff cuts TTFT from 1029ms to 397ms at 8K context

Deep Dive

Multi-agent LLM systems running on edge devices face a fundamental challenge: when one agent hands off a conversation to another, it must transfer latent context efficiently. Current approaches either re-prefill the entire context from scratch (expensive) or transfer full-precision KV caches (bandwidth-heavy). QKVShare, proposed by researchers Pratik Honavar and Tejpratap GVSL, tackles this with a quantized KV-cache handoff framework that combines token-level mixed-precision allocation, a self-contained CacheCard representation, and a HuggingFace-compatible cache injection path. The key innovation is that instead of transmitting raw KV values, QKVShare compresses them into a standardized format that can be injected directly into the receiving agent's attention mechanism, bypassing the need for re-prefill.

On 150 GSM8K reasoning problems using Llama-3.1-8B-Instruct, QKVShare demonstrated its clearest latency gains under repeated handoffs with deeper hops and higher budgets. At nominal 1K context, TTFT dropped from 150.2 ms to 130.7 ms; at 8K context, the gap widened dramatically—397.1 ms vs 1029.7 ms, a 2.6× improvement. Adaptive quantization remained competitive against uniform quantization, especially in deeper settings. Notably, post-injection generation (not card creation) dominates the current latency path, suggesting further optimization opportunities. The results position quantized KV handoff as a promising direction for on-device multi-agent systems, though the authors note the need for stronger controller ablations and apples-to-apples runtime comparisons.

Key Points
  • QKVShare reduces TTFT by up to 2.6× vs full re-prefill (397 ms vs 1030 ms at 8K context) across 150 GSM8K problems with Llama-3.1-8B-Instruct
  • Uses token-level mixed-precision allocation and a CacheCard representation for standardized quantized KV handoff, compatible with HuggingFace cache injection
  • Post-injection generation, not card creation, dominates latency in current implementation, pointing to further optimization headroom

Why It Matters

Enables efficient multi-agent LLM coordination on edge devices, unlocking complex reasoning without cloud dependency.