Research & Papers

HFX: Joint Design of Algorithms and Systems for Multi-SLO Serving and Fast Scaling

New system cuts NPU usage 49.81% while hitting strict SLO targets.

Deep Dive

A new paper from a large team of researchers introduces HFX, a production-grade LLM serving system designed to tackle the dual challenge of meeting strict service-level objectives (SLOs) while minimizing computational cost under dynamic, multi-task workloads. Unlike existing approaches that rely on static scheduling policies or focus on single-task settings, HFX jointly optimizes request scheduling and elastic scaling across model replicas. The system features a scheduler that performs proactive budget estimation and prioritization to ensure SLO compliance for both new and in-flight requests, and a scaler that supports fast device-to-device (D2D) weight transfer to reduce cold-start latency.

HFX also supports both colocated and disaggregated prefill/decode deployments, enabling adaptation to diverse workload patterns and cloud environments. In extensive experiments on multi-task workloads, HFX demonstrated consistently higher SLO attainment, lower end-to-end latency, and lower NPU usage cost by up to 4.44x, 65.82%, and 49.81%, respectively, compared to state-of-the-art systems. The results highlight the effectiveness of SLO-aware scheduling and scaling in practical LLM serving, providing a robust framework for cost-efficient and SLO-compliant deployments.

Key Points
  • HFX's scheduler uses proactive budget estimation and prioritization for both new and in-flight requests.
  • The scaler supports fast device-to-device (D2D) weight transfer, cutting cold-start latency.
  • Achieved up to 4.44x lower NPU cost, 65.82% lower latency, and 49.81% higher SLO attainment in tests.

Why It Matters

HFX makes LLM serving cheaper and faster, crucial for scaling production AI under real-world SLO constraints.