Research & Papers

Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs

arXiv cs.DC May 07, 2026

⚡New optimization cuts scheduling from hours to seconds, boosting goodput by 2.39x.

Deep Dive

As LLM usage fragments across countless models, cloud providers offer a wide range of mid-tier and older-generation GPUs that often deliver comparable performance per dollar to top-tier hardware but remain underutilized. Enter Coral, a new serving system designed by researchers from Carnegie Mellon University and UC Berkeley to efficiently harness these heterogeneous resources for concurrent multi-LLM deployments. Coral’s key innovation is a joint optimization of resource allocation and serving strategy across all model replicas, treating the entire cluster as a single scheduling problem rather than optimizing each model independently. This holistic approach allows Coral to adapt to shifting throughput demands and resource availability in real time.

To make this joint optimization practical, Coral applies a lossless two-stage decomposition that preserves optimality while slashing solve time from hours to just tens of seconds. In evaluations across six popular LLMs and 20 distinct GPU configurations, Coral reduced serving costs by up to 2.79× over the best baseline systems (including state-of-the-art schedulers like Shepherd and MUX). Under scarce resource conditions, Coral delivered up to 2.39× higher goodput—meaning more successful requests completed on time. The system is particularly valuable for organizations running multiple LLMs (e.g., GPT-3.5, Llama 3, Mistral) on spot or preemptible cloud instances, where resource availability fluctuates. Coral's source code and benchmarks are expected to be released, enabling cloud operators to dramatically lower inference costs without sacrificing performance.

Key Points

Coral jointly optimizes resource allocation and serving strategy across all model replicas, achieving up to 2.79x cost reduction versus baselines.
It uses a lossless two-stage decomposition that cuts online solve time from hours to tens of seconds, enabling real-time adaptation.
Evaluated on 6 LLMs and 20 GPU configurations, Coral delivered up to 2.39x higher goodput under scarce resource availability.

Why It Matters

Coral makes multi-LLM serving drastically cheaper and more reliable on spot/preemptible cloud GPUs, lowering barriers to production AI.

Read Original Article

Coral: Cost-Efficient Multi-LLM Serving over Heterogeneous Cloud GPUs

Why It Matters

Stay Ahead in AI