FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving
New research paper introduces FASER, cutting latency by 1.92x and improving throughput by 53% in vLLM.
A team of researchers including Wenyan Chen, Chengzhi Lu, Yanying Lin, and Dmitrii Ustiugov has published a paper introducing FASER (Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving). The system addresses critical limitations in current speculative decoding approaches used to accelerate large language model inference. Existing systems typically set a fixed speculative token length for entire batches and serialize the draft and verification phases, making them rigid and inefficient under dynamic online traffic. This leads to prolonged latency during low load (due to blocked verification phases) and wasted computation on rejected tokens during high load.
FASER introduces two key innovations: first, it dynamically adjusts the speculative length for each individual request within a continuous batch and performs early pruning of rejected tokens during verification to minimize computational waste. Second, it breaks the verification phase into smaller frontiers or chunks, allowing them to overlap with the draft phase through fine-grained spatial multiplexing with minimal resource interference. This overlapping execution prevents GPU underutilization and reduces idle time.
The researchers implemented a FASER prototype within the popular vLLM inference serving framework. Their experimental results show dramatic improvements: up to 53% higher throughput and up to 1.92× lower latency compared to state-of-the-art speculative decoding systems. These gains come from better adaptation to volatile inference traffic patterns and more efficient use of GPU resources. The paper, submitted to arXiv on April 22, 2026 (arXiv:2604.20503), represents a significant advancement in making LLM inference more responsive and cost-effective for real-world applications.
- Dynamically adjusts speculative token length per request within batches, unlike rigid batch-level settings in current systems
- Overlaps draft and verification phases by breaking verification into chunks, reducing GPU idle time and blocking
- Prototype in vLLM shows 53% higher throughput and 1.92x lower latency versus state-of-the-art systems
Why It Matters
Makes real-time LLM applications like chatbots and coding assistants significantly faster and more cost-efficient to serve at scale.