Research & Papers

PCR: A Prefetch-Enhanced Cache Reuse System for Low-Latency RAG Serving

A new system tackles the latency bottleneck in RAG by intelligently prefetching and reusing cached data.

Deep Dive

A team of researchers from Shanghai Jiao Tong University and other institutions has introduced PCR, a novel system designed to slash latency in Retrieval-Augmented Generation (RAG) serving. RAG systems enhance LLMs by pulling in external documents, but this creates very long input sequences. The major bottleneck is the 'prefill' stage, where the system must compute key-value (KV) representations for every token in the context. PCR tackles this by maximizing the reuse of previously computed KV caches across requests that share overlapping context, avoiding redundant and expensive computations.

PCR's innovation lies in overcoming three practical limits of naive cache reuse: low hit rates, slow data transfer, and sluggish storage I/O. It does this with three core techniques. First, a prefix-tree cache structure uses a 'look-ahead LRU' policy, examining pending requests in the queue to make smarter eviction decisions. Second, 'layer-wise overlapping' pipelines the loading of KV caches with GPU computation across different CUDA streams, hiding communication delays. Third, 'queue-based prefetching' proactively pulls relevant caches from SSD storage into faster DRAM before they are needed by the model.

The results are significant. In extensive experiments, PCR demonstrated it could outperform existing KV-cache reuse methods, achieving up to a 2.47x improvement in average Time-To-First-Token (TTFT). This metric is critical for user experience, as it measures how long a user waits for the AI to start generating a response. For companies running high-throughput RAG applications—like customer support chatbots or enterprise search tools—this level of speedup can dramatically reduce infrastructure costs and improve responsiveness under heavy load.

Key Points
  • Uses a 'look-ahead LRU' cache policy informed by pending requests to boost hit rates.
  • Pipelines data loading and GPU computation across CUDA streams to hide transfer latency.
  • Proactively prefetches caches from SSD to DRAM, cutting I/O wait times for a 2.47x TTFT speedup.

Why It Matters

This directly reduces cost and latency for enterprise RAG applications, making real-time, document-aware AI assistants more scalable.