Research & Papers

Orthrus: 32 parallel tokens with zero accuracy loss and 7.8x speedup

Frozen backbone LLM gets a diffusion head to generate 32 tokens at once.

Deep Dive

Orthrus, detailed in a new paper, solves a core problem in large language model inference: generating tokens one at a time is slow. The team injected a lightweight diffusion attention module into every layer of a frozen Qwen3-8B model, creating a dual-view architecture. The diffusion head generates K=32 tokens in parallel using a single-step denoising process, while the original autoregressive head validates the output and accepts the longest matching prefix. Crucially, both heads share a single KV cache, adding only ~4.5 MiB of overhead—a flat O(1) memory cost. The output distribution is provably identical to the frozen base model, so accuracy doesn't suffer. The approach trains just 16% of the total parameters (the diffusion modules) on fewer than 1 billion tokens, and the entire training run completes in 24 hours on 8 H200 GPUs.

Against competing methods, Orthrus shows clear advantages. Compared to other diffusion-based LMs like Dream, Fast-dLLM-v2, SDAR, Mercury, and Gemini Diffusion—which modify base weights and lose accuracy (Fast-dLLM-v2 drops 11 points on MATH-500)—Orthrus freezes the backbone and matches Qwen3-8B exactly. Versus speculative decoding techniques such as EAGLE-3 and DFlash, Orthrus requires no external drafter model, no separate KV cache, and incurs zero Time-to-First-Token penalty. Acceptance length on MATH-500 reaches 11.7 tokens, versus 7.9 for DFlash and 3.5 for EAGLE-3. Single-step denoising (6.35 tokens per footprint) outperforms multi-step denoising (3.53). The method also benefits from KL distillation over cross-entropy training for higher acceptance rates. Limitations include heavy dependence on the frozen base model (inheriting its biases and hallucinations), evaluation only on Qwen3, and support for greedy and rejection sampling only.

Key Points
  • Orthrus generates 32 tokens in parallel using a single-step diffusion head while the AR head verifies for exact accuracy; no distribution shift.
  • Achieves 7.8x tokens-per-footprint and ~6x wall-clock speedup on MATH-500, with only ~4.5 MiB additional KV cache overhead (O(1)).
  • Training requires only 16% of parameters, <1B tokens, and 24 hours on 8 H200 GPUs; acceptance length is 11.7 tokens vs 3.5 for EAGLE-3.

Why It Matters

Enables near-lossless 6x faster LLM inference without external drafters, ideal for latency-sensitive deployments.