Research & Papers

Learning from Noisy Preferences: A Semi-Supervised Learning Approach to Direct Preference Optimization

Researchers tackle noisy human preferences with semi-supervised learning, boosting image alignment without extra annotations.

Deep Dive

A team of researchers led by Xinxin Liu and Ming Li introduced Semi-DPO, a semi-supervised learning method designed to improve the alignment of diffusion models with complex human visual preferences. The core problem they address is that existing preference datasets collapse multi-dimensional human judgments—covering aesthetics, detail fidelity, and semantic alignment—into single binary labels (winner/loser). This compression creates severe label noise, as an image excelling in some dimensions but failing in others is still simply marked as a winner or loser. The authors theoretically demonstrate that this noise generates conflicting gradient signals that misguide standard Diffusion Direct Preference Optimization (DPO), degrading model performance.

To overcome this, Semi-DPO treats consistent preference pairs (where multiple dimensions agree) as clean labeled data, and conflicting pairs as noisy unlabeled data. The method first trains a model on a consensus-filtered clean subset. This model then serves as an implicit classifier to generate pseudo-labels for the noisy set, allowing iterative refinement. Experimental results show that Semi-DPO achieves state-of-the-art performance in aligning diffusion models with complex human preferences, all without requiring additional human annotations or explicit reward models during training. The team plans to release their code and models publicly.

Key Points
  • Semi-DPO treats consistent preference pairs as clean labeled data and conflicting pairs as noisy unlabeled data.
  • The method trains on a consensus-filtered clean subset, then uses the model to generate pseudo-labels for noisy data for iterative refinement.
  • Achieves state-of-the-art alignment with complex human preferences without extra human annotation or explicit reward models.

Why It Matters

Semi-DPO reduces reliance on expensive human annotation, making high-quality image alignment more scalable for real-world diffusion model applications.