JetSpec from Hao AI Lab achieves 9.64x LLM inference speedup
New speculative decoding method hits 1000 tokens per second on a single B200 GPU.
JetSpec, developed by Hao AI Lab, introduces a novel approach to speculative decoding that breaks the traditional speed-quality trade-off. Prior methods faced a dilemma: autoregressive draft heads maintain causality but become expensive as tree depth increases, while block-diffusion heads draft cheaply in one pass but produce mutually inconsistent branches. JetSpec solves this by generating a causality-preserving tree in a single forward pass — called causal parallel tree drafting — enabling both high drafting quality and low computational cost.
Performance results are striking: on the MATH-500 benchmark, JetSpec achieves up to 9.64x end-to-end speedup without any loss in output quality, and 4.58x speedup on open-ended chat tasks. When combined with CUDA graph and kernel-level optimizations, the system reaches roughly 1000 tokens per second on a single NVIDIA B200 GPU. This makes JetSpec one of the fastest lossless inference accelerators available, particularly valuable for latency-sensitive applications like real-time chatbots, code assistants, and interactive agents. The project is open-source with code, demos, and detailed blog posts available online.
- Up to 9.64x end-to-end speedup on MATH-500 and 4.58x on open-ended chat, with zero quality loss.
- Uses causal parallel tree drafting to preserve coherence while generating the draft tree in a single pass.
- Achieves ~1000 tokens per second on a single B200 GPU through CUDA graph and kernel optimizations.
Why It Matters
JetSpec slashes LLM inference latency, making real-time AI applications on single GPUs more feasible and cost-effective.