Research & Papers

Serving Hybrid LLM Loads with SLO Guarantees Using CPU-GPU Attention Piggybacking

New system offloads attention computation to CPUs, freeing GPUs to handle 1.48x more latency-sensitive requests.

Deep Dive

A research team led by Zizhao Mo has introduced OmniServe, a breakthrough system for managing hybrid AI workloads in shared computing clusters. The core innovation is an 'Attention Piggybacking' mechanism that dynamically offloads the computationally intensive Attention operations from best-effort (BE) AI services—like batch processing or non-urgent tasks—to available CPU resources. This prevents these BE tasks from monopolizing GPU memory and causing interference with latency-sensitive (LS) services, such as real-time chatbots, which have strict Service Level Objective (SLO) requirements for response time. By facilitating asynchronous communication between CPU and GPU streams, OmniServe ensures GPUs are never blocked waiting for results, dramatically improving overall hardware utilization.

OmniServe also implements a dynamic batching control policy that adapts to fluctuating request patterns, optimizing the 'Dense module' computations through layer-wise batching. In practical tests, the system delivered staggering performance gains: it improved the SLO attainment rate for critical latency-sensitive services by up to 1.48 times. Simultaneously, it unleashed the potential of best-effort services, boosting their serving throughput by nearly an order of magnitude—up to 9.85 times higher—compared to existing state-of-the-art serving systems like vLLM or TGI. This represents a major leap in data center efficiency, allowing providers to host more diverse AI models on the same infrastructure without performance penalties.

Key Points
  • Uses 'Attention Piggybacking' to offload BE service computations to CPUs, preventing GPU blocking and memory contention.
  • Improves SLO attainment for latency-sensitive services by 1.48x and boosts best-effort service throughput by 9.85x.
  • Features dynamic batching to adapt to request fluctuations, optimizing layer-wise computation for improved efficiency.

Why It Matters

Enables cloud providers to run more AI services cost-effectively on shared hardware while guaranteeing performance for critical applications.