BiScale: Energy-Efficient Disaggregated LLM Serving via Phase-Aware Placement and DVFS
New research slashes AI inference energy costs by nearly half while maintaining performance on models like Llama 3.3 70B.
A new research paper titled 'BiScale: Energy-Efficient Disaggregated LLM Serving via Phase-Aware Placement and DVFS' introduces a breakthrough approach to reducing the massive energy consumption of large language model inference. Developed by researchers including Omar Basit and Yunzhao Liu, BiScale tackles the growing problem of AI's energy footprint by optimizing how LLM serving systems use GPU resources across different inference phases.
The framework employs a hierarchical two-tier control system. At coarse timescales, it computes optimal placement of computational tasks and baseline GPU frequencies to minimize energy while satisfying service-level objectives (SLOs) for Time to First Token (TTFT) and Time Per Output Token (TPOT). At fine timescales, it dynamically adjusts GPU frequency per iteration using specialized controllers: model predictive control (MPC) for the compute-intensive prefill phase to account for queue evolution, and lightweight slack-aware adaptation for the memory-bound decode phase.
This approach addresses key challenges in disaggregated LLM serving, where prefill and decode operations run on separate hardware. Traditional autoscaling is too slow to track rapid workload fluctuations, while applying fine-grained DVFS is complicated by phase-asymmetric dynamics and coupling between resource provisioning and frequency control. BiScale's coordinated optimization across these timescales enables significant energy savings without compromising performance.
In evaluations on a 16x NVIDIA H100 GPU cluster serving Llama 3.3 70B with production-style traces, BiScale demonstrated remarkable efficiency gains. It achieved energy reductions of up to 39% during prefill and 48% during decode phases compared to DistServe, a state-of-the-art disaggregated serving system, while consistently meeting strict latency SLOs. This represents a major advancement toward sustainable AI infrastructure as model deployments continue to scale.
- BiScale reduces LLM inference energy by 39% in prefill and 48% in decode phases compared to DistServe
- Uses two-tier optimization with phase-aware placement and dynamic voltage/frequency scaling (DVFS) across GPU clusters
- Tested on 16x H100 cluster running Llama 3.3 70B while maintaining strict TTFT/TPOT latency SLOs
Why It Matters
Dramatically reduces AI's energy costs and carbon footprint as LLM deployments scale globally, making sustainable AI infrastructure feasible.