Research & Papers

Serving Compound Inference Systems on Datacenter GPUs

New serving system uses GPU spatial partitioning to handle 11.3x more requests while cutting resource use by over 50%.

Deep Dive

A team of researchers from the University of Illinois Urbana-Champaign, led by Sriram Devata, Rahul Singh, and Sarita Adve, has introduced JigsawServe, a groundbreaking framework designed to tackle the complex challenge of serving compound AI inference systems. These systems, which are becoming essential for applications like extended reality (XR), involve chaining multiple ML models (like vision, language, and audio models) into a single task graph to fulfill a user request. JigsawServe is the first system to jointly optimize the three critical axes of serving: end-to-end latency, model accuracy, and GPU resource cost. It does this by making two key, adaptive decisions per request: choosing the most appropriate variant for each model in the graph (balancing speed and accuracy) and performing fine-grained, spatial partitioning of GPU resources to run multiple model tasks concurrently.

Empirical results demonstrate JigsawServe's dramatic efficiency gains. In a large-scale analytical evaluation, it increased the maximum serviceable request rate by 11.3 times compared to the closest prior work. In practical tests, it consumed only 43.3% of the available GPU resources while successfully meeting accuracy service-level objectives (SLOs) and keeping latency SLO violations below 0.6%. The research confirms that all three of its core features—accuracy scaling, GPU spatial partitioning, and task-graph-informed resource budgeting—are essential; removing any one significantly reduces its efficiency. The work is set to be presented at the 5th HCDS workshop co-located with ASPLOS 2026.

Key Points
  • Increases maximum serviceable demand by 11.3x compared to prior systems.
  • Uses only 43.3% of GPU resources while meeting accuracy SLOs with <0.6% latency violations.
  • Jointly optimizes model variant selection and GPU spatial partitioning for compound inference graphs.

Why It Matters

Dramatically reduces the cost and complexity of running next-gen AI applications like XR, making them more scalable and commercially viable.