Research & Papers

DynamicPO: Dynamic Preference Optimization for Recommendation

More negatives used to hurt performance—DynamicPO makes them help instead.

Deep Dive

A team of researchers (Xingyu Hu et al., from multiple institutions) identified a counterintuitive problem in LLM-based recommendation systems using direct preference optimization (DPO): as the number of negative samples increases, recommendation performance actually drops, even though the training loss continues to decrease. They call this phenomenon preference optimization collapse. The root cause is gradient suppression—easy-to-discriminate negatives dominate the gradient, while boundary-critical negatives (those near the user's true preference boundary) are under-optimized.

To fix this, they propose DynamicPO, a lightweight, plug-and-play framework with two adaptive mechanisms. The first, Dynamic Boundary Negative Selection, actively identifies and prioritizes negative samples that lie close to the model's decision boundary (the most informative ones). The second, Dual-Margin Dynamic beta Adjustment, per-sample calibrates the strength of the DPO optimization based on how ambiguous the boundary is for that example. Experiments on three public datasets show DynamicPO prevents optimization collapse and improves accuracy over existing multi-negative preference optimization methods, with negligible computational overhead. The code and datasets are publicly available.

Key Points
  • Preference optimization collapse: adding more negatives degrades performance despite lower training loss, due to gradient suppression from easy negatives.
  • DynamicPO introduces two mechanisms: Dynamic Boundary Negative Selection (picks informative negatives near the decision boundary) and Dual-Margin Dynamic beta Adjustment (calibrates optimization per sample).
  • Tested on three public datasets, DynamicPO improves recommendation accuracy with negligible overhead, and code is open-sourced.

Why It Matters

Smarter negative sampling in AI recommenders can boost accuracy without extra compute, fixing a hidden flaw in DPO training.