ξ-DPO's ratio reward margin cuts hyperparameter tuning in half
New method ξ-DPO makes AI preference optimization hyperparameter-free with interpretable margins.
Preference optimization is critical for aligning large language models, but reference-free methods like SimPO (Simple Preference Optimization) suffer from a central challenge: joint tuning of hyperparameters β and γ. The authors of ξ-DPO analyze SimPO and find that β implicitly controls sample filtering, while γ's effect depends on the reward gap structure of the dataset. This makes trial-and-error tuning unavoidable in practice, limiting widespread adoption.
To solve this, ξ-DPO first reformulates the preference objective via an equivalent transformation: instead of maximizing reward gap likelihood, it minimizes the distance between reward gaps and optimal margins. Then, it redefines the reward as a ratio between chosen and rejected responses. This cancels β's effect and yields a bounded, interpretable margin called ξ. Unlike γ, ξ explicitly represents the desired relative separation between chosen and rejected responses and can be derived from the dataset's initial reward distribution. The result is a hyperparameter-light method that retains SimPO's performance without the debugging overhead.
- ξ-DPO eliminates hyperparameter tuning by replacing β and γ with a single interpretable ratio reward margin ξ.
- Analysis reveals that β controls sample filtering and γ depends on reward gap structure, causing tuning difficulty.
- The new margin ξ is bounded, data-driven, and directly set from the initial reward gap distribution.
Why It Matters
Makes reference-free preference optimization practical for aligning LLMs at scale.