Combines one-forward-pass drafting efficiency with branch-wise causal conditioning, avoiding the causality-efficiency dilemma of prior head-based methods?

Combines one-forward-pass drafting efficiency with branch-wise causal conditioning, avoiding the causality-efficiency dilemma of prior head-based methods.

Outperforms bidirectional-head and tree-based speculative decoding baselines consistently on dense and MoE Qwen3 models?

Outperforms bidirectional-head and tree-based speculative decoding baselines consistently on dense and MoE Qwen3 models.

Research & Papers

JetFlow: 9.64x faster LLM inference with parallel tree drafting

arXiv cs.CL June 18, 2026

⚡JetFlow shatters speculative decoding's scaling limit, hitting up to 9.64x speedup on H100 GPUs.

Deep Dive

Speculative decoding (SD) accelerates autoregressive LLMs by drafting multiple tokens and verifying them in parallel, but scaling the draft budget has historically hit a ceiling: more drafts only help if acceptance remains high and drafting overhead low. Existing head-based methods face a causality-efficiency dilemma—autoregressive drafters produce well-conditioned candidates but grow costly with tree depth, while bidirectional block-diffusion drafters generate all positions in one pass but produce inconsistent branches that waste budget and reduce acceptance length.

JetFlow, developed by a team spanning UCSD, ByteDance, and MSRA, solves this by training a single causal parallel draft head over the frozen target model's fused hidden states. This head generates candidate trees whose scores align with the target model's autoregressive factorization, converting larger draft budgets into longer accepted prefixes. Benchmarks on dense and MoE Qwen3 models show JetFlow achieving up to 9.64x speedup on MATH-500 and 4.58x on open-ended conversational workloads using H100 GPUs, with additional latency improvements demonstrated through vLLM integration under realistic serving loads. Code and models are open-sourced.

Key Points

JetFlow achieves up to 9.64x speedup on MATH-500 and 4.58x on open-ended conversational benchmarks on H100 GPUs.
Combines one-forward-pass drafting efficiency with branch-wise causal conditioning, avoiding the causality-efficiency dilemma of prior head-based methods.
Outperforms bidirectional-head and tree-based speculative decoding baselines consistently on dense and MoE Qwen3 models.

Why It Matters

JetFlow practically doubles LLM throughput for math and chat workloads, lowering serving costs and latency at scale.

Read Original Article

JetFlow: 9.64x faster LLM inference with parallel tree drafting

Why It Matters

Related Articles

🚀 Stay Ahead in AI