New RLHF Theory Shows Preference-Only Learning Converges to Optimal Policy
Researchers prove RL from binary preferences works with kernel MDPs, scaling sublinearly.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
A new paper from researchers Nikola Pavlovic, Sattar Vakili, and Qing Zhao tackles a foundational question in reinforcement learning from human feedback (RLHF): can an agent learn an optimal policy using only binary preference comparisons, without access to numeric rewards? The work, posted on arXiv, studies episodic kernel Markov decision processes (MDPs), one of the most expressive models that still allows theoretical analysis. In each episode, the learner deploys two policies from the same start state and receives a single binary label indicating which trajectory is preferred. This feedback is modeled via a Bradley-Terry-Luce model on the difference of cumulative (unobserved) rewards.
The authors develop novel preference-based value estimation methods and confidence sets that exploit the kernel structure. They prove high-probability regret bounds that scale sublinearly in the number of episodes, meaning the policy's performance converges to the optimal policy's performance as more data is collected. Importantly, the analysis handles end-of-episode feedback—a realistic setting where humans only see the entire trajectory before giving a preference. This provides rigorous theoretical grounding for preference-only RLHF, moving beyond simple tabular or linear models to general function approximation with kernels.
- Single binary preference label per episode comparing two trajectories, modeled with Bradley-Terry-Luce on cumulative reward differences.
- New preference-based value estimation and confidence sets designed for kernel MDPs with end-of-episode feedback.
- Proven high-probability regret bounds that scale sublinearly, ensuring convergence to optimal policy under general kernel assumptions.
Why It Matters
Provides rigorous theoretical foundations for RLHF, enabling safer and more sample-efficient alignment of AI systems.