SLO-Aware Compute Resource Allocation for Prefill-Decode Disaggregated LLM Inference
New hybrid model combines queuing theory with benchmarks to predict optimal compute allocation.
A team of researchers including Luchang Li, Dongfang Li, Bozhao Gong, and Yu Zhang has published a significant paper on arXiv titled 'SLO-Aware Compute Resource Allocation for Prefill-Decode Disaggregated LLM Inference.' The work addresses a critical gap in large language model deployment: while Prefill-Decode disaggregation has become a standard optimization technique for separating the initial prompt processing (prefill) from token generation (decode), there's been no established methodology for determining the optimal hardware allocation between these two phases. The researchers propose a hybrid solution that combines theoretical modeling with empirical measurements to solve this resource allocation problem, enabling more cost-effective LLM serving while maintaining strict Service Level Objectives (SLOs) for both Time-To-First-Token (TTFT) and Time-Per-Output-Token (TPOT).
Technically, the approach models the prefill process using M/M/1 queuing theory to derive achieved throughput from benchmarked maximum prefill throughput and TTFT requirements. For the decode phase, the method determines decode batch sizes that meet TPOT requirements and obtains corresponding throughput through empirical measurements. The experimental results demonstrate that this hybrid method can accurately predict optimal P/D resource allocation in real-world scenarios, potentially reducing inference costs by up to 30% compared to current heuristic approaches. This represents a significant advancement for cloud providers and AI companies running large-scale LLM inference services, as it provides a systematic way to balance performance guarantees with infrastructure costs for models like GPT-4, Claude 3, and Llama 3 in production environments.
- Combines queuing theory (M/M/1 model) with empirical benchmarking to predict optimal hardware split between prefill and decode phases
- Addresses Service Level Objectives (SLOs) for both Time-To-First-Token (TTFT) and Time-Per-Output-Token (TPOT) simultaneously
- Experimental results show accurate prediction of resource allocation, potentially reducing inference costs by 30% compared to heuristic methods
Why It Matters
Enables cloud providers and AI companies to serve LLMs more cost-effectively while maintaining strict performance guarantees for end-users.