NVIDIA's Nemotron-TwoTower is a diffusion LM with 2.42x throughput
A new diffusion-based LM that generates tokens in parallel, not one by one
NVIDIA has introduced Nemotron-TwoTower-30B-A3B-Base-BF16, a novel diffusion-based language model that breaks from the traditional autoregressive token-by-token generation paradigm. Built on the Nemotron 3 Nano 30B-A3B backbone, the model employs a two-tower architecture: a frozen autoregressive context tower and a diffusion denoiser tower. The denoiser iteratively fills blocks of tokens in parallel, allowing faster generation without sacrificing quality.
According to NVIDIA, the default mask-diffusion setup achieves 98.7% of the autoregressive baseline's aggregate benchmark quality while reaching 2.42 times the wall-clock generation throughput. This represents a meaningful efficiency gain for large-scale language model inference, potentially reducing latency and compute costs for applications like chatbots, code generation, and content creation.
- Nemotron-TwoTower uses a diffusion denoiser to generate token blocks in parallel
- Achieves 98.7% of autoregressive baseline quality with 2.42x throughput
- Built on the Nemotron 3 Nano 30B-A3B backbone from NVIDIA
Why It Matters
Faster generation with minimal quality loss could unlock new real-time applications and reduce inference costs.