LLM-Emu: Native Runtime Emulation of LLM Inference via Profile-Driven Sampling
Test LLM serving systems without expensive GPU runs — accuracy within 5%
Evaluating LLM serving systems normally requires running real workloads on expensive GPUs, capturing dynamic arrivals, queueing, and batching. Existing simulators either operate offline, re-implement schedulers, or depend on accurate kernel-level models. LLM-Emu, developed by Wei Da and Evangelia Kalyvianaki from the University of Cambridge, takes a different approach: it preserves vLLM’s production HTTP, scheduling, KV-cache, and output-processing paths, replacing only the GPU forward pass with profile-driven latency sampling and synthetic token generation. This keeps the emulation lightweight and native to the serving engine.
Tested on two GPU types, four model variants (from two families), two attention backends (including FlashAttention), and realistic workloads (Poisson and bursty ShareGPT), LLM-Emu maintains high fidelity: time-per-output-token (TPOT) and inter-token latency (ITL) deviate by at most 4.8%, end-to-end latency by 5.3%, and output throughput by 1.9%. Time-to-first-token (TTFT) shows larger variance (max error 10.4%), due to its sensitivity to admission control and queue state. The tool is open-sourced, providing a practical way to run online experiments without GPU costs.
- LLM-Emu achieves TPOT and ITL error ≤4.8%, E2E latency error ≤5.3%, and output throughput error ≤1.9% vs. real vLLM on GPUs.
- Works across two GPU types, four model variants (e.g., LLaMA and Falcon), two attention backends, and two workload patterns.
- Open-source emulator that replaces GPU forward execution with profile-sampled latency and synthetic tokens while keeping serving engine paths intact.
Why It Matters
Enables cheap, accurate online experimentation for LLM serving, slashing GPU costs for researchers and engineers.