First diffusion drafter generates token block with confidence scores to identify rejection boundaries?

First diffusion drafter generates token block with confidence scores to identify rejection boundaries.

Second variable-prefix drafter re-anchors at top-K prefixes and proposes diverse continuations in one batch?

Second variable-prefix drafter re-anchors at top-K prefixes and proposes diverse continuations in one batch.

Cascade attention verifies all shared-prefix candidates jointly, increasing acceptance rate without extra cost?

Cascade attention verifies all shared-prefix candidates jointly, increasing acceptance rate without extra cost.

Research & Papers

D^2SD uses dual diffusion drafters to speed up LLM inference

arXiv cs.DC June 04, 2026

⚡Two diffusion models collaborate to guess more tokens correctly per verification step.

Deep Dive

Speculative decoding speeds up large language model inference by having a draft model generate multiple tokens in parallel, then verifying them in a single forward pass of the target model. Previous diffusion-based drafters produce an entire block of tokens at once but often commit to a single sequence — discarding all subsequent tokens if the first mismatch occurs. That leads to low acceptance rates. Simply generating more candidate sequences in parallel doesn't help much because redundant or poorly placed branches inflate computation without increasing accepted tokens.

D^2SD (Dual Diffusion Speculative Decoding) solves this with a two-stage pipeline. The first diffusion drafter generates a block along with per-position confidence scores. These scores are used to build a confidence-guided prefix tree that identifies the most likely rejection boundary. Then the second variable-prefix diffusion drafter re-anchors at each selected prefix and proposes multiple alternative continuations in a single batched pass. The resulting candidates share common prefixes, so they can be verified simultaneously via cascade attention — dramatically increasing the number of tokens accepted per verification step.

Empirical results show D^2SD clearly outperforms both the underlying diffusion approach and strong autoregressive speculative decoding baselines. By recovering tokens that would otherwise be discarded, the method achieves higher throughput without sacrificing model quality. The work highlights how structured candidate diversity can make speculative decoding substantially more efficient, especially for latency-sensitive applications.

Key Points

First diffusion drafter generates token block with confidence scores to identify rejection boundaries.
Second variable-prefix drafter re-anchors at top-K prefixes and proposes diverse continuations in one batch.
Cascade attention verifies all shared-prefix candidates jointly, increasing acceptance rate without extra cost.

Why It Matters

Faster LLM inference means cheaper API costs and lower latency for real-time applications.

Read Original Article

D^2SD uses dual diffusion drafters to speed up LLM inference

Why It Matters

Related Articles

🚀 Stay Ahead in AI