Anchor-guided reward model solves non-identifiability for pluralistic preferences
Two anchor labels unlock Gaussian reward modeling for diverse human values.
Standard Bradley-Terry reward models struggle when human preferences are pluralistic—i.e., when different people genuinely disagree. While soft preference labels preserve that disagreement, BT can only express it by shrinking reward margins, losing nuance. Gaussian reward models, which predict both a mean and variance, offer a natural alternative but suffer from a fundamental non-identifiability when trained solely on pairwise comparisons: multiple variance settings can produce the same observed preferences.
Fang et al. introduce a novel solution: augment preference data with two coarse response-level anchor labels (e.g., "this response is good" or "bad"). They prove that two anchors are sufficient to break the non-identifiability, and develop a joint training objective with a proven non-asymptotic convergence rate for both mean and variance. Across simulations and four real-world datasets with diverging preferences, the method consistently improves reward modeling accuracy and downstream RLHF performance—including better PPO training and best-of-N selection. This work offers a practical way to align AI with diverse, pluralistic human values.
- Gaussian reward models for pluralistic preferences suffer from non-identifiability that the method fixes using just two anchor labels.
- The framework provides a non-asymptotic convergence rate for both reward mean and variance estimation.
- Outperforms standard BT models across four real-world diverging-preference datasets in reward modeling and RLHF tasks (PPO, best-of-N).
Why It Matters
Enables more accurate AI alignment with diverse human values, improving RLHF for pluralistic preference datasets.