Robotics

KEEP: A KV-Cache-Centric Memory Management System for Efficient Embodied Planning

New memory management system cuts AI planning latency by 90% while improving task success rates.

Deep Dive

A research team led by Zebin Yang has introduced KEEP, a novel memory management system designed to dramatically improve the efficiency of Large Language Models (LLMs) in embodied planning tasks for robotics. The system addresses a critical bottleneck: current approaches that store memory as raw text create excessively long prompts and high prefill latency, while methods that reuse KV caches suffer from frequent recomputation overhead. KEEP's breakthrough comes from three key innovations that optimize how AI agents maintain and access their memory of past experiences and environmental states during complex, multi-step tasks.

KEEP implements a Static-Dynamic Memory Construction algorithm that reduces KV cache recomputation through mixed-granularity memory grouping, a Multi-hop Memory Re-computation algorithm that dynamically identifies important cross-attention patterns, and Layer-balanced Memory Loading that eliminates computational imbalances across model layers. Experimental results on the ALFRED benchmark show KEEP delivers 2.68x speedup compared to text-based memory methods while maintaining accuracy, and outperforms the state-of-the-art KV recomputation method CacheBlend with 4.13% higher success rates and 1.90x faster initial response times. This represents a significant step toward making AI-powered robots more practical for real-world applications requiring extended reasoning and planning.

Key Points
  • Achieves 2.68x speedup over text-based memory methods on ALFRED dataset with minimal accuracy loss
  • Outperforms CacheBlend (EuroSys'25) with 4.13% higher success rate and 1.90x faster time-to-first-token
  • Uses three novel algorithms to optimize KV cache management for long-horizon planning tasks

Why It Matters

Enables more efficient and capable AI robots for complex real-world tasks by solving critical memory bottlenecks.