Research & Papers

Position: LLM Serving Needs Mathematical Optimization and Algorithmic Foundations, Not Just Heuristics

vLLM and SGLang rely on FIFO and LRU—researcher says that's not enough.

Deep Dive

A position paper posted on arXiv by Zijie Zhou makes a bold claim: LLM inference serving has outgrown the generic heuristics that current systems rely on. Systems like vLLM and SGLang are state-of-the-art in practice, but their algorithmic cores still use decades-old policies—request routing with join-shortest-queue or round-robin, scheduling with FIFO, and KV cache eviction with LRU. While these heuristics work well for traditional workloads, they fail to account for the unique characteristics of LLM inference, leading to unpredictable performance across diverse scenarios.

The paper identifies key structural differences: KV cache memory that grows dynamically during generation, asymmetry between prefill and decode phases, unknown output lengths, and the constraints of continuous batching. The author argues that the field must develop mathematical models that capture these traits, enabling algorithms with provable guarantees. Emerging work at the intersection of operations research and ML systems shows that principled methods can match or exceed heuristic performance. Zhou calls on the community to treat algorithmic design for LLM serving as a new research frontier, moving beyond ad-hoc fixes to a rigorous foundation.

Key Points
  • Current LLM serving systems (vLLM, SGLang) use generic heuristics like join-shortest-queue, FIFO, and LRU from classical distributed computing.
  • LLM inference introduces unique dynamics: growing KV cache, prefill-decode asymmetry, unknown output lengths, and continuous batching constraints.
  • The paper advocates for mathematical optimization and algorithmic foundations to provide provable performance guarantees across diverse workloads.

Why It Matters

Better LLM serving algorithms could drastically cut latency and costs for production systems at scale.