JetFlow: 9.64x faster LLM inference with parallel tree drafting
JetFlow shatters speculative decoding's scaling limit, hitting up to 9.64x speedup on H100 GPUs.
Speculative decoding (SD) accelerates autoregressive LLMs by drafting multiple tokens and verifying them in parallel, but scaling the draft budget has historically hit a ceiling: more drafts only help if acceptance remains high and drafting overhead low. Existing head-based methods face a causality-efficiency dilemma—autoregressive drafters produce well-conditioned candidates but grow costly with tree depth, while bidirectional block-diffusion drafters generate all positions in one pass but produce inconsistent branches that waste budget and reduce acceptance length.
JetFlow, developed by a team spanning UCSD, ByteDance, and MSRA, solves this by training a single causal parallel draft head over the frozen target model's fused hidden states. This head generates candidate trees whose scores align with the target model's autoregressive factorization, converting larger draft budgets into longer accepted prefixes. Benchmarks on dense and MoE Qwen3 models show JetFlow achieving up to 9.64x speedup on MATH-500 and 4.58x on open-ended conversational workloads using H100 GPUs, with additional latency improvements demonstrated through vLLM integration under realistic serving loads. Code and models are open-sourced.
- JetFlow achieves up to 9.64x speedup on MATH-500 and 4.58x on open-ended conversational benchmarks on H100 GPUs.
- Combines one-forward-pass drafting efficiency with branch-wise causal conditioning, avoiding the causality-efficiency dilemma of prior head-based methods.
- Outperforms bidirectional-head and tree-based speculative decoding baselines consistently on dense and MoE Qwen3 models.
Why It Matters
JetFlow practically doubles LLM throughput for math and chat workloads, lowering serving costs and latency at scale.