SPECTRE speeds up LLM inference 2.28x using idle small models
Reusing underutilized 'tail' models as drafters boosts throughput 2.28x with minimal interference.
LLM serving platforms face a long-tail demand pattern: a few large models get most requests while many smaller models sit underutilized. SPECTRE (Parallel Speculative Decoding with a Multi-Tenant Remote Drafter) turns this inefficiency into a speedup by using those idle small models as speculative drafters for large models. The key innovation is enabling draft generation and target verification to run in parallel, which is made practical through three techniques: a hybrid ordinary-parallel decoding strategy guided by a throughput-analysis threshold, speculative priority scheduling to maintain overlap under multi-tenant traffic, and draft-side prompt compression to cut latency.
Built on SGLang, SPECTRE was tested across multiple drafter-target pairs, reasoning benchmarks, and real-world long-context workloads. For large-model deployments like Qwen3-235B-A22B with TP=8, it delivered up to 2.28x speedup over autoregressive decoding and an additional 66% relative improvement over the strongest speculative baselines—all while causing only minor interference to the tail models' native workloads. The code is open-sourced, making it practical for cloud providers to dramatically boost throughput without adding GPUs.
- Repurposes underutilized small 'tail' models as drafters for large LLMs via parallel speculative decoding
- Achieves up to 2.28x speedup over autoregressive decoding on Qwen3-235B (TP=8)
- 66% relative improvement over the strongest existing speculative decoding baselines with minimal tail-model interference
Why It Matters
Makes GPU clusters far more efficient by turning idle capacity into free speed boosts for large models.