Research & Papers

TMPO boosts diffusion model diversity by 9.1% with trajectory matching

New RL method fixes reward hacking while preserving generative diversity across tasks.

Deep Dive

Reinforcement learning (RL) has been a powerful tool for aligning diffusion models to specific downstream tasks, but it often suffers from reward hacking — where the model collapses into a narrow set of high-reward outputs, sacrificing diversity and causing visual mode collapse. A new paper, "TMPO: Trajectory Matching Policy Optimization for Diverse and Efficient Diffusion Alignment," identifies the root cause as the mode-seeking nature of existing methods that maximize expected reward without constraining the probability distribution over acceptable trajectories. To address this, the authors propose TMPO, which shifts from scalar reward maximization to trajectory-level reward distribution matching. TMPO introduces a Softmax Trajectory Balance (Softmax-TB) objective that aligns the policy probabilities of K trajectories with a reward-induced Boltzmann distribution. The authors prove this objective inherits the mode-covering property of forward KL divergence, ensuring the model explores all acceptable trajectories while optimizing reward.

To make TMPO practical for large-scale flow-matching models, the method incorporates Dynamic Stochastic Tree Sampling. Here, multiple trajectories share denoising prefixes and branch at dynamically scheduled steps, cutting redundant computation without sacrificing training effectiveness. Extensive experiments across human preference alignment, compositional generation, and text rendering show that TMPO improves generative diversity by 9.1% over state-of-the-art methods. It also achieves competitive performance on all downstream and efficiency metrics, striking an optimal balance between reward and diversity. The paper is authored by Jiaming Li, Chenyu Zhu, and 10 others, and is available on arXiv (2605.10983).

Key Points
  • TMPO replaces scalar reward maximization with trajectory-level reward distribution matching using a Softmax Trajectory Balance objective.
  • The method improves generative diversity by 9.1% over state-of-the-art RL-based diffusion alignment methods.
  • Dynamic Stochastic Tree Sampling reduces training time by sharing denoising prefixes across trajectories with dynamic branching.

Why It Matters

TMPO offers a practical fix for reward hacking in diffusion models, enabling high-quality, diverse outputs for real-world generative AI applications.