SpecTr-GBV: Multi-Draft Block Verification Accelerating Speculative Decoding
New method boosts speculative decoding speed with optimal theoretical efficiency.
Speculative decoding (SD) has long been a key technique for reducing the inference latency of autoregressive language models, which suffer from slow sequential generation. Traditional SD uses a lightweight draft model to propose candidate tokens, which a larger target model then verifies and accepts. While prior work has explored either using multiple drafts to increase acceptance rates or block verification to validate several tokens at once, these approaches have been treated separately, leaving potential gains on the table.
SpecTr-GBV, introduced by Yijun Lin, Jinhao Sheng, Qingyue Cai, and Feng Zhou, unifies both strategies by framing the verification step as an optimal transport problem over draft and target token blocks. The authors prove theoretically that SpecTr-GBV achieves the optimal expected acceptance length possible under i.i.d. draft generation, and that this bound improves with more drafts. Empirically, across five datasets and four baseline methods, SpecTr-GBV demonstrates superior speedup and significantly higher block efficiency while maintaining output quality. Comprehensive ablation studies further validate the impact of key hyperparameters.
- Unifies multi-draft and greedy block verification into a single optimal transport framework
- Theoretically proven to achieve optimal expected acceptance length for i.i.d. drafts
- Outperforms four baselines across five datasets with superior speedup and block efficiency
Why It Matters
Faster LLM inference without quality loss could significantly reduce costs and latency for production AI systems.