Make Every Draft Count: Hidden State based Speculative Decoding
New method reuses failed draft computations, achieving up to 3.3x speedup over standard speculative decoding.
A research team led by Yuetao Chen and Xuliang Wang has published a breakthrough paper titled 'Make Every Draft Count: Hidden State based Speculative Decoding' on arXiv. The work tackles a fundamental inefficiency in the popular speculative decoding technique used to accelerate Large Language Model (LLM) inference. While speculative decoding uses a small 'draft' model to propose candidate tokens for verification by a larger target model, most of these draft tokens are rejected, wasting the computation used to generate them. The researchers' key innovation is to shift the auto-regressive prediction from the token level to the hidden state level within the draft model.
This architectural shift allows the system to preserve and reuse the semantic information from discarded drafts, as hidden states are not 'contaminated' by incorrect token predictions. The team designed a specialized draft model architecture and an efficient token information injection mechanism to construct high-quality draft token trees and resample from verification failures. By eliminating the overhead in their design, they maximize hardware utilization. Extensive evaluations show the method delivers up to a 3.3x speedup compared to standard speculative decoding baselines, representing a major leap in making the inference of models like GPT-4 and Claude more cost-effective and responsive.
- Proposes hidden state-level auto-regressive prediction to prevent semantic contamination from incorrect tokens, enabling draft reuse.
- Achieves up to 3.3x speedup over standard speculative decoding by recycling computation from failed draft verifications.
- Introduces a specialized draft model architecture and token injection mechanism to build efficient draft token trees for resampling.
Why It Matters
Dramatically reduces the cost and latency of running large AI models, making advanced AI more accessible for real-time applications.