National University of Singapore Presents "DMax": A New Paradigm For Diffusion Language Models (dLLMs) Enabling Aggressive Parallel Decoding.
New training method lets diffusion language models revise their own errors, enabling aggressive parallel generation.
Researchers from the National University of Singapore have introduced DMax, a novel framework designed to solve a critical bottleneck in diffusion language models (dLLMs). While dLLMs promise faster text generation by predicting multiple tokens in parallel, their speed is severely limited by error accumulation—early incorrect guesses corrupt subsequent predictions, forcing a trade-off between speed and quality. DMax overcomes this by fundamentally rethinking the decoding process as a progressive self-refinement, where the model learns to iteratively correct its own erroneous outputs before finalizing them.
At its core are two innovations: On-Policy Uniform Training and Soft Parallel Decoding. The training strategy exposes the model to its own imperfect predictions, teaching it to recover clean tokens from noisy inputs. During inference, Soft Parallel Decoding represents intermediate states as a blend between a predicted token embedding and a mask embedding, preserving uncertainty and enabling revision. This allows the model to decode much more aggressively. Benchmarks show dramatic speedups: on the GSM8K math dataset, throughput per forward pass (TPF) jumped from 2.04 to 5.47 while maintaining accuracy. On two NVIDIA H200 GPUs, the model achieved an impressive 1,338 tokens per second at a batch size of 1.
The results demonstrate that DMax isn't just a marginal improvement but a paradigm shift that makes high-parallelism decoding practical for dLLMs. By enabling models to self-revise, it decouples speed from quality degradation, a longstanding challenge. This work paves the way for diffusion-based models to become a more viable and efficient alternative to traditional autoregressive LLMs for real-time, high-throughput applications.
- Enables aggressive parallel decoding by letting models self-correct errors, boosting throughput per forward pass (TPF) by over 2.7x on benchmarks like GSM8K.
- Uses novel On-Policy Uniform Training, teaching the model to recover from its own imperfect predictions, and Soft Parallel Decoding for iterative refinement in embedding space.
- Achieved 1,338 tokens per second on two H200 GPUs at batch size 1, demonstrating practical high-speed generation while preserving task accuracy.
Why It Matters
Unlocks significantly faster text generation for AI applications without sacrificing quality, making real-time, high-volume AI interactions more feasible.