Qwen3.6-35B-A3B MoE reaches 3,238 tok/s at 10 diffusion steps, ~6,500 tok/s at 4 steps on RTX 5090 (32GB VRAM)?

Qwen3.6-35B-A3B MoE reaches 3,238 tok/s at 10 diffusion steps, ~6,500 tok/s at 4 steps on RTX 5090 (32GB VRAM)

Qwen3.6-27B dense model achieves 745 tok/s at 10 steps, ~1,500 tok/s at 4 steps?

Qwen3.6-27B dense model achieves 745 tok/s at 10 steps, ~1,500 tok/s at 4 steps

Encoder is deleted during inference; only diffusion head and Perceiver decoder remain, enabling fast single‑GPU generation?

Encoder is deleted during inference; only diffusion head and Perceiver decoder remain, enabling fast single‑GPU generation

Open Source

Open-dLLM hits 3,238 tok/s on RTX 5090 with Qwen3.6 diffusion model

r/LocalLLaMA May 16, 2026

⚡Diffusion LLM on a single consumer GPU achieves over 3,000 tokens per second.

Deep Dive

The Open-dLLM project, originally created by Fred Zhangzhi Peng, Shuibai Zhang, and Alex Tong, converts autoregressive language models into diffusion-based generative models. A developer recently forked the repository and used AI tools to update the codebase—over six months old—to support Qwen3.6 and incorporate the latest LDLM (Latent Diffusion Language Model) paper by Viacheslav Meshchaninov et al. The result is a pipeline that runs diffusion steps on a frozen LLM backbone, using a Perceiver decoder to map latents back to tokens. Training requires the encoder (which produces latent targets), but at inference the encoder is deleted entirely, leaving only the diffusion head and decoder.

On a single RTX 5090 with 32GB VRAM, the Qwen3.6-35B-A3B—a mixture-of-experts model with only 3B active parameters per token—achieves 3,238 tok/s with 10 diffusion steps and an estimated 6,500 tok/s with 4 steps. The dense 27B variant (6.75B trainable params) reaches 745 tok/s at 10 steps and ~1,500 at 4 steps. These numbers assume a short 64‑token sequence length, batch size 1, and untrained weights. The authors note that longer sequences reduce throughput proportionally, but batch scaling is near‑linear for the MoE model due to its smaller hidden dimension. While the weights are untrained, the inference speed is identical to a trained model. Quality benchmarks (perplexity, HumanEval) are pending after training completes.

Key Points

Qwen3.6-35B-A3B MoE reaches 3,238 tok/s at 10 diffusion steps, ~6,500 tok/s at 4 steps on RTX 5090 (32GB VRAM)
Qwen3.6-27B dense model achieves 745 tok/s at 10 steps, ~1,500 tok/s at 4 steps
Encoder is deleted during inference; only diffusion head and Perceiver decoder remain, enabling fast single‑GPU generation

Why It Matters

Diffusion LLMs on consumer hardware could lower the barrier for large‑scale language model inference.

Read Original Article

Open-dLLM hits 3,238 tok/s on RTX 5090 with Qwen3.6 diffusion model

Why It Matters

Related Articles

🚀 Stay Ahead in AI