Orthrus-Qwen3-8B speeds up inference 7.8x with frozen backbone
A new method keeps Qwen3-8B accuracy while achieving 7.8x faster token generation
Orthrus-Qwen3-8B is a new inference acceleration technique that injects a trainable diffusion attention module into each layer of a frozen autoregressive Transformer (Qwen3-8B). Both heads share a single KV cache, and the diffusion head projects 32 tokens in parallel while the AR head verifies in a second pass, accepting the longest matching prefix. The output distribution is provably identical to the base model. Training requires only 16% of parameters (trained on <1B tokens in 24 hours on 8xH200). The KV overhead is constant at ~4.5 MiB.
On MATH-500, Orthrus achieves up to 7.8x tokens/forward and ~6x wall-clock speedup. Acceptance length is 11.7, outperforming DFlash (7.9) and EAGLE-3 (3.5). Unlike diffusion LMs that modify base weights and lose accuracy (e.g., Fast-dLLM-v2 drops 11 points on MATH-500), Orthrus freezes the backbone and preserves accuracy exactly. Single-step denoising (6.35 TPF) beats multi-step (3.53 TPF), and KL distillation outperforms CE on acceptance rate. Limitations include being strictly bounded by the frozen base model and Qwen3-only evaluation.
- Up to 7.8x tokens/forward and 6x wall-clock speedup on MATH-500 with exact accuracy match to Qwen3-8B.
- No external drafter model, zero TTFT penalty, and constant KV overhead (~4.5 MiB) using shared cache.
- Single-step denoising and KL distillation achieve acceptance length 11.7, outperforming DFlash (7.9) and EAGLE-3 (3.5).
Why It Matters
Speeds up existing LLM inference without accuracy loss, enabling real-time AI applications with minimal overhead.