Up to 9.64x end-to-end speedup on MATH-500 and 4.58x on open-ended chat, with zero quality loss?

Up to 9.64x end-to-end speedup on MATH-500 and 4.58x on open-ended chat, with zero quality loss.

Uses causal parallel tree drafting to preserve coherence while generating the draft tree in a single pass?

Uses causal parallel tree drafting to preserve coherence while generating the draft tree in a single pass.

Achieves ~1000 tokens per second on a single B200 GPU through CUDA graph and kernel optimizations?

Achieves ~1000 tokens per second on a single B200 GPU through CUDA graph and kernel optimizations.

Open Source

JetSpec from Hao AI Lab achieves 9.64x LLM inference speedup

r/LocalLLaMA June 26, 2026

⚡New speculative decoding method hits 1000 tokens per second on a single B200 GPU.

Deep Dive

JetSpec, developed by Hao AI Lab, introduces a novel approach to speculative decoding that breaks the traditional speed-quality trade-off. Prior methods faced a dilemma: autoregressive draft heads maintain causality but become expensive as tree depth increases, while block-diffusion heads draft cheaply in one pass but produce mutually inconsistent branches. JetSpec solves this by generating a causality-preserving tree in a single forward pass — called causal parallel tree drafting — enabling both high drafting quality and low computational cost.

Performance results are striking: on the MATH-500 benchmark, JetSpec achieves up to 9.64x end-to-end speedup without any loss in output quality, and 4.58x speedup on open-ended chat tasks. When combined with CUDA graph and kernel-level optimizations, the system reaches roughly 1000 tokens per second on a single NVIDIA B200 GPU. This makes JetSpec one of the fastest lossless inference accelerators available, particularly valuable for latency-sensitive applications like real-time chatbots, code assistants, and interactive agents. The project is open-source with code, demos, and detailed blog posts available online.

Key Points

Up to 9.64x end-to-end speedup on MATH-500 and 4.58x on open-ended chat, with zero quality loss.
Uses causal parallel tree drafting to preserve coherence while generating the draft tree in a single pass.
Achieves ~1000 tokens per second on a single B200 GPU through CUDA graph and kernel optimizations.

Why It Matters

JetSpec slashes LLM inference latency, making real-time AI applications on single GPUs more feasible and cost-effective.

Read Original Article

JetSpec from Hao AI Lab achieves 9.64x LLM inference speedup

Why It Matters

Related Articles

🚀 Stay Ahead in AI