D^2SD uses dual diffusion drafters to speed up LLM inference
Two diffusion models collaborate to guess more tokens correctly per verification step.
Speculative decoding speeds up large language model inference by having a draft model generate multiple tokens in parallel, then verifying them in a single forward pass of the target model. Previous diffusion-based drafters produce an entire block of tokens at once but often commit to a single sequence — discarding all subsequent tokens if the first mismatch occurs. That leads to low acceptance rates. Simply generating more candidate sequences in parallel doesn't help much because redundant or poorly placed branches inflate computation without increasing accepted tokens.
D^2SD (Dual Diffusion Speculative Decoding) solves this with a two-stage pipeline. The first diffusion drafter generates a block along with per-position confidence scores. These scores are used to build a confidence-guided prefix tree that identifies the most likely rejection boundary. Then the second variable-prefix diffusion drafter re-anchors at each selected prefix and proposes multiple alternative continuations in a single batched pass. The resulting candidates share common prefixes, so they can be verified simultaneously via cascade attention — dramatically increasing the number of tokens accepted per verification step.
Empirical results show D^2SD clearly outperforms both the underlying diffusion approach and strong autoregressive speculative decoding baselines. By recovering tokens that would otherwise be discarded, the method achieves higher throughput without sacrificing model quality. The work highlights how structured candidate diversity can make speculative decoding substantially more efficient, especially for latency-sensitive applications.
- First diffusion drafter generates token block with confidence scores to identify rejection boundaries.
- Second variable-prefix drafter re-anchors at top-K prefixes and proposes diverse continuations in one batch.
- Cascade attention verifies all shared-prefix candidates jointly, increasing acceptance rate without extra cost.
Why It Matters
Faster LLM inference means cheaper API costs and lower latency for real-time applications.