New paper reveals optimal N for Best-of-N preference learning
How to choose N for reward learning? A new theory has the answer.
Researchers from CMU and CMU-affiliated (Pukdee, Balcan, Ravikumar) have published a theoretical analysis of reward learning from Best-of-N preference data, a widely used method in RLHF and alignment. The paper, appearing on arXiv in May 2026, specializes a recent conditional-distribution framework to understand what Bradley-Terry (BT) models actually learn from such data. For the common Best-vs-Random and Best-vs-Worst variants where chosen and rejected responses are coupled from the same candidate set, they prove that exact BT representability generally fails, but minimizers of bounded classes converge to the reference targets as N increases.
The key practical insight is a fundamental tradeoff: larger N widens the pairwise margins between the chosen and rejected responses, making preference signals stronger, but it reduces connectivity in the comparison graph, which hurts sample efficiency. The authors provide explicit design principles: use larger N when preference label collection is the costly bottleneck (e.g., human annotations), and smaller N when generation of candidates is the bottleneck. They also show that shaping the base distribution to concentrate mass between the most important response pairs can improve alignment. Synthetic and real experiments validate the predicted dependence on sample size and distribution shape, giving practitioners clear guidelines for tuning N in reward modeling pipelines.
- For coupled Best-vs-Random data, exact BT representability fails, but bounded minimizers approach reference targets as N grows.
- Larger N widens pairwise margins but reduces graph connectivity, creating a tradeoff that depends on whether labeling or generation is the bottleneck.
- Design principles: use larger N when labels are costly, smaller N when generation is costly; shape base distribution to focus comparisons on test-relevant response pairs.
Why It Matters
Provides actionable theory for tuning N in reward learning, directly impacting RLHF efficiency and alignment quality.