Research & Papers

SpecTr-GBV: Multi-Draft Block Verification Accelerating Speculative Decoding

arXiv cs.CL April 30, 2026

⚡New method boosts speculative decoding speed with optimal theoretical efficiency.

Deep Dive

Speculative decoding (SD) has long been a key technique for reducing the inference latency of autoregressive language models, which suffer from slow sequential generation. Traditional SD uses a lightweight draft model to propose candidate tokens, which a larger target model then verifies and accepts. While prior work has explored either using multiple drafts to increase acceptance rates or block verification to validate several tokens at once, these approaches have been treated separately, leaving potential gains on the table.

SpecTr-GBV, introduced by Yijun Lin, Jinhao Sheng, Qingyue Cai, and Feng Zhou, unifies both strategies by framing the verification step as an optimal transport problem over draft and target token blocks. The authors prove theoretically that SpecTr-GBV achieves the optimal expected acceptance length possible under i.i.d. draft generation, and that this bound improves with more drafts. Empirically, across five datasets and four baseline methods, SpecTr-GBV demonstrates superior speedup and significantly higher block efficiency while maintaining output quality. Comprehensive ablation studies further validate the impact of key hyperparameters.

Key Points

Unifies multi-draft and greedy block verification into a single optimal transport framework
Theoretically proven to achieve optimal expected acceptance length for i.i.d. drafts
Outperforms four baselines across five datasets with superior speedup and block efficiency

Why It Matters

Faster LLM inference without quality loss could significantly reduce costs and latency for production AI systems.

Read Original Article

SpecTr-GBV: Multi-Draft Block Verification Accelerating Speculative Decoding

Why It Matters

Stay Ahead in AI