WISP: Waste- and Interference-Suppressed Distributed Speculative LLM Serving at the Edge via Dynamic Drafting and SLO-Aware Batching
New research tackles LLM serving bottlenecks, improving system goodput by up to 3.7x over prior methods.
A research team from Virginia Tech and Queen's University Belfast has introduced WISP, a novel system designed to optimize how large language models (LLMs) are served across distributed networks, particularly at the edge. The core problem they address is the current inefficiency where most LLM inference requests from smartphones and laptops are sent to centralized data centers, overloading servers while leaving powerful edge devices underutilized. WISP employs a technique called speculative decoding, where smaller models on edge devices quickly generate "draft" tokens that a larger, more accurate model in the cloud then verifies. This balances the computational load.
However, the researchers identified two major bottlenecks in this distributed approach: Wasted Drafting Time (when edge devices draft tokens the server ultimately rejects) and Verification Interference (where verification requests from many edges overwhelm the central server). WISP tackles these with three intelligent components: a speculation controller that dynamically adjusts how many tokens to draft based on predicted acceptance rates, a verification time estimator, and an SLO-aware batch scheduler that efficiently groups verification requests. The results are significant, with the system achieving up to a 2.1x improvement in capacity over purely centralized serving and a 4.1x boost over a prior distributed method called SLED, all without sacrificing the model's output quality.
- Improves system capacity by up to 4.1x and goodput by 3.7x compared to the SLED method for distributed serving.
- Uses dynamic drafting to suppress wasted edge computation and SLO-aware batching to manage server verification load.
- Enables lossless speculative decoding, balancing workload between edge and cloud without compromising LLM output accuracy.
Why It Matters
This could drastically reduce cloud costs and latency for AI applications, making advanced LLMs more scalable and responsive on everyday devices.