Research & Papers

Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter

A new architecture decouples AI prefill and decode, enabling cross-datacenter serving with commodity networks.

Deep Dive

A research team from Tsinghua University and other institutions has introduced Prefill-as-a-Service (PrfaaS), a groundbreaking architecture designed to overcome a major bottleneck in large-scale AI model serving. Current systems using prefill-decode (PD) disaggregation are limited because they must transfer massive amounts of KVCache—the memory of past tokens generated during the initial 'prefill' phase—between processors. This forces prefill and decode tasks to be tightly coupled within a single data center using expensive, high-bandwidth networks (like RDMA), restricting deployment flexibility and scaling.

PrfaaS leverages newer, more efficient 'hybrid-attention' model architectures that produce smaller KVCache. However, the key innovation is a system that doesn't just rely on smaller cache size. It intelligently offloads the computationally intensive prefill work for long-context requests to dedicated, compute-optimized clusters. The resulting, much smaller KVCache is then transferred over standard, cost-effective Ethernet to separate clusters that handle the repetitive 'decode' phase of generating text. This decoupling allows each type of workload to be scaled independently on the most suitable hardware.

To make this practical for real-world, variable workloads, PrfaaS incorporates smart scheduling. It uses bandwidth-aware scheduling to manage traffic between data centers and cache-aware request placement to optimize performance. This system design removes the strict requirement for a unified, low-latency network fabric, enabling a truly heterogeneous and elastic infrastructure. The researchers demonstrated its efficacy in a case study using a massive 1-trillion-parameter hybrid model, where PrfaaS outperformed both traditional homogeneous deployments and naive offloading approaches.

Key Points
  • Decouples prefill & decode: Offloads compute-heavy prefill to specialized clusters, sending only compressed KVCache over Ethernet.
  • Boosts throughput by 54%: In a test with a 1T-parameter model, it significantly outperformed standard serving architectures.
  • Enables flexible, cost-effective scaling: Allows prefill and decode capacity to be scaled independently across loosely coupled data centers.

Why It Matters

This could drastically reduce the cost and complexity of deploying massive AI models, making advanced LLM services more scalable and accessible.