Research & Papers

Masked Diffusion LLMs beat autoregressive models 4x for agentic RL

New any-order denoising LLMs produce globally coherent rollouts and boost task success by 15%.

Deep Dive

Autoregressive LLM world models suffer from a fundamental limitation: they generate next-state predictions left-to-right, making them unable to condition on globally interdependent anchors such as tool schemas, trailing status fields, or expected outcomes. This results in prefix-consistent but globally incoherent rollouts—a failure mode known as prefix mode collapse. Masked Diffusion Language Models (MDLMs) sidestep this via an any-order denoising objective that learns every conditional direction from the same training signal, enabling coherent generation that respects global constraints.

Empirically, fine-tuned MDLMs (SDAR-8B and WeDLM-8B) outperform autoregressive baselines up to 4x their total parameter count across BLEU-1, ROUGE-L, and MAUVE on both in- and out-of-domain splits. Lower Self-BLEU and higher Distinct-N confirm reduced repetition and increased diversity. When used for GRPO (group relative policy optimization) training, MDLM-generated rollouts deliver up to +15% absolute task-success gains over AR-generated data on held-out benchmarks (ScienceWorld, ALFWorld, AppWorld) across 1.2B–7B backbone models (LFM2.5, Qwen3, Mistral) in a zero-shot transfer setting.

Key Points
  • MDLMs' any-order denoising avoids prefix mode collapse, producing globally coherent world model rollouts.
  • Fine-tuned MDLMs (SDAR-8B, WeDLM-8B) beat autoregressive models up to 4x their parameter count on BLEU-1, ROUGE-L, and MAUVE.
  • GRPO with MDLM rollouts yields up to +15% task-success gains on ScienceWorld, ALFWorld, and AppWorld across 1.2B–7B backbones.

Why It Matters

More coherent and diverse world models mean stronger agent RL training, directly improving task success in robotics and simulation.