Open Source

NVIDIA's Nemotron-TwoTower is a diffusion LM with 2.42x throughput

A new diffusion-based LM that generates tokens in parallel, not one by one

Deep Dive

NVIDIA has introduced Nemotron-TwoTower-30B-A3B-Base-BF16, a novel diffusion-based language model that breaks from the traditional autoregressive token-by-token generation paradigm. Built on the Nemotron 3 Nano 30B-A3B backbone, the model employs a two-tower architecture: a frozen autoregressive context tower and a diffusion denoiser tower. The denoiser iteratively fills blocks of tokens in parallel, allowing faster generation without sacrificing quality.

According to NVIDIA, the default mask-diffusion setup achieves 98.7% of the autoregressive baseline's aggregate benchmark quality while reaching 2.42 times the wall-clock generation throughput. This represents a meaningful efficiency gain for large-scale language model inference, potentially reducing latency and compute costs for applications like chatbots, code generation, and content creation.

Key Points
  • Nemotron-TwoTower uses a diffusion denoiser to generate token blocks in parallel
  • Achieves 98.7% of autoregressive baseline quality with 2.42x throughput
  • Built on the Nemotron 3 Nano 30B-A3B backbone from NVIDIA

Why It Matters

Faster generation with minimal quality loss could unlock new real-time applications and reduce inference costs.

📬 Get the top 10 AI stories daily