SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding
New benchmark reveals synthetic data inflates real-world AI speed gains by up to 2x.
A team of eight researchers, including Talor Abramovich and Benjamin Chislett, has released SPEED-Bench, a comprehensive new benchmark designed to solve a critical problem in AI infrastructure: accurately measuring the performance of speculative decoding. Speculative decoding is a leading technique for speeding up large language model (LLM) inference by using a smaller 'draft' model to predict tokens that a larger 'target' model then verifies. However, its performance is highly dependent on the input data, and existing benchmarks fail to reflect real-world conditions due to limited task diversity and a lack of throughput-oriented testing.
SPEED-Bench addresses these gaps with two core components: a 'Qualitative' data split curated for maximum semantic diversity to test accuracy, and a 'Throughput' data split that evaluates speed gains across a realistic range of workloads, from low-batch latency scenarios to high-concurrency server loads. Crucially, it integrates directly with production-grade inference engines like vLLM and TensorRT-LLM, allowing practitioners to analyze system behaviors often masked by simpler tests. In their paper, the team used SPEED-Bench to make several key findings, including that synthetic inputs can overestimate real-world throughput by up to 2x, and that the optimal length for draft predictions is dependent on batch size.
The release of SPEED-Bench establishes a much-needed unified standard for comparing SD algorithms and system implementations. By providing a realistic and diverse testing ground, it enables AI engineers to make better-informed decisions when deploying accelerated inference, ultimately leading to more efficient and cost-effective AI services. The benchmark and its data are publicly available, aiming to foster more reproducible and practical research in this high-stakes area of AI optimization.
- Provides two specialized data splits: a 'Qualitative' set for semantic diversity and a 'Throughput' set for realistic concurrency testing.
- Integrates with production engines (vLLM, TensorRT-LLM) to reveal system behaviors, finding synthetic data inflates throughput gains by up to 2x.
- Helps identify optimal configurations like batch-size dependent draft lengths and analyzes caveats in techniques like vocabulary pruning.
Why It Matters
Enables AI engineers to accurately benchmark and deploy faster, more cost-efficient LLM inference in real production environments.