TraFL Trajectory-Balance Post-Training Boosts Diffusion Language Models
New method fixes 'trajectory locking' in diffusion LMs, beating all baselines on math and code.
Current post-training methods for diffusion language models rely on reward-maximizing objectives that cause a failure mode called 'trajectory locking': sampled reward-driven updates concentrate probability mass onto a narrow set of denoising paths, reducing coverage of alternative correct solutions under repeated sampling. This limits the model's ability to generate diverse and robust outputs, especially as the sampling budget grows.
To address this, the team behind TraFL introduces a trajectory-balance objective that trains the policy toward a reward-tilted target distribution anchored to a frozen reference model. They make this practical for diffusion LMs with a sequence-level surrogate and a learned prompt-dependent normalization. On mathematical reasoning (Minerva Math) and code generation (LiveCodeBench) benchmarks, TraFL is the only evaluated method that improves over the base model across all benchmark-length settings, with gains that persist as the sampling budget increases. These improvements also transfer to held-out evaluations, establishing TraFL as the strongest method on every LiveCodeBench difficulty split.
- TraFL solves 'trajectory locking' where reward-driven post-training overly narrows denoising paths in diffusion language models.
- Uses a trajectory-balance objective to train toward a reward-tilted distribution anchored to a frozen reference model, with a practical diffusion-compatible surrogate.
- Outperforms all baselines on Minerva Math and LiveCodeBench, improving over the base model in every benchmark-length setting with gains that persist at higher sampling budgets.
Why It Matters
Enables more reliable and diverse outputs from diffusion language models, crucial for high-stakes reasoning and code generation tasks.