Speculative Decoding Implementations: EAGLE-3, Medusa-1, PARD, Draft Models, N-gram and Suffix Decoding from scratch [P]
Implementing 6 speculative decoding methods from scratch to reveal real speedups.
A new open-source repository by developer shreyansh26 offers a hands-on, educational implementation of six speculative decoding methods, built entirely from scratch. The repo covers EAGLE-3, Medusa-1, standard draft model speculation, PARD (parallel draft models), n-gram prompt lookup, and suffix decoding, all behind a shared decoding and evaluation contract. The goal is to make the algorithmic and systems-level tradeoffs of speculative decoding explicit, particularly how proposer quality interacts with verifier cost. The target model is Qwen/Qwen2.5-7B-Instruct, with learned proposers using small speculative heads or draft models, while training-free methods derive proposers from the prompt or generated context.
The project addresses common misconceptions: why a high acceptance rate doesn't guarantee higher throughput, how PARD can outperform autoregressive draft models despite lower acceptance, and the key differences between EAGLE/Medusa-style learned heads and traditional draft model speculation. It also explores how simple methods like n-gram and suffix decoding perform when prompts contain reusable structures. The repo includes benchmark summaries, command lines, and checkpoints, though numbers are treated as implementation benchmarks due to compute constraints. This resource is designed for developers and researchers wanting to understand speculative decoding at the algorithm-systems boundary, from training proposers to caching and verification.
- Implements 6 speculative decoding methods from scratch: EAGLE-3, Medusa-1, draft models, PARD, n-gram, and suffix decoding
- Uses Qwen2.5-7B-Instruct as target model with small learned heads or draft models for proposers
- Includes training and inference paths, with benchmarks clarifying why high acceptance rate doesn't always mean higher throughput
Why It Matters
Demystifies speculative decoding for practitioners, enabling faster LLM inference through transparent, reproducible implementations.