Nemotron-TwoTower uses a diffusion denoiser to generate token blocks in parallel?

Nemotron-TwoTower uses a diffusion denoiser to generate token blocks in parallel

Achieves 98.7% of autoregressive baseline quality with 2.42x throughput?

Achieves 98.7% of autoregressive baseline quality with 2.42x throughput

Built on the Nemotron 3 Nano 30B-A3B backbone from NVIDIA?

Built on the Nemotron 3 Nano 30B-A3B backbone from NVIDIA

Open Source

NVIDIA's Nemotron-TwoTower is a diffusion LM with 2.42x throughput

r/LocalLLaMA June 26, 2026

⚡A new diffusion-based LM that generates tokens in parallel, not one by one

Deep Dive

NVIDIA has introduced Nemotron-TwoTower-30B-A3B-Base-BF16, a novel diffusion-based language model that breaks from the traditional autoregressive token-by-token generation paradigm. Built on the Nemotron 3 Nano 30B-A3B backbone, the model employs a two-tower architecture: a frozen autoregressive context tower and a diffusion denoiser tower. The denoiser iteratively fills blocks of tokens in parallel, allowing faster generation without sacrificing quality.

According to NVIDIA, the default mask-diffusion setup achieves 98.7% of the autoregressive baseline's aggregate benchmark quality while reaching 2.42 times the wall-clock generation throughput. This represents a meaningful efficiency gain for large-scale language model inference, potentially reducing latency and compute costs for applications like chatbots, code generation, and content creation.

Key Points

Nemotron-TwoTower uses a diffusion denoiser to generate token blocks in parallel
Achieves 98.7% of autoregressive baseline quality with 2.42x throughput
Built on the Nemotron 3 Nano 30B-A3B backbone from NVIDIA

Why It Matters

Faster generation with minimal quality loss could unlock new real-time applications and reduce inference costs.

Read Original Article

NVIDIA's Nemotron-TwoTower is a diffusion LM with 2.42x throughput

Why It Matters

Related Articles

🚀 Stay Ahead in AI