Research & Papers

Agent Memory Below the Prompt: Persistent Q4 KV Cache for Multi-Agent LLM Inference on Edge Devices

New technique slashes AI agent startup from 15.7 seconds to near-instant by persisting memory to disk.

Deep Dive

A new research paper by Yakov Pyotr Shkolnikov tackles a critical bottleneck for deploying multi-agent AI systems on consumer hardware like laptops and phones. The core problem is that device RAM is too limited to hold the working memory (KV cache) for multiple AI agents simultaneously. On an Apple M4 Pro, only 3 agents with 8K context could fit in memory at standard FP16 precision, forcing constant, slow cache eviction and reloading—a process that took 15.7 seconds per agent. Shkolnikov's solution, called 'Persistent Q4 KV Cache,' fundamentally changes this by saving each agent's quantized memory state to disk for rapid restoration.

The system introduces three key components: a block pool for isolated per-agent caches, a BatchQuantizedKVCache for concurrent inference, and cross-phase context injection. By quantizing the KV cache to 4-bit (Q4) and storing it in safetensors format, the method fits 4x more agent contexts into the same memory and enables near-instant reactivation. Benchmarks on models like Gemma 3 12B show time-to-first-token improvements of 22x to 136x for contexts from 4K to 32K tokens, with minimal accuracy loss (e.g., -0.7% perplexity for Gemma). This breakthrough, now open-sourced, paves the way for complex, persistent AI assistants and workflows to run entirely on-device, unlocking new applications in personal computing and mobile AI.

Key Points
  • Cuts agent startup time by up to 136x by restoring quantized KV cache from disk instead of recomputing.
  • Enables 4x more AI agents in fixed device memory using 4-bit quantization versus standard FP16 precision.
  • Open-source system tested on Gemma 3, DeepSeek-Coder, and Llama 3.1 with less than 3% perplexity impact.

Why It Matters

Enables complex, multi-step AI workflows with persistent memory to run locally on laptops and phones, reducing cloud dependency.