RankQ beats prior RL methods by 42.7% in low-data vision-based robot learning
Self-supervised action ranking boosts robot success from 43% to 85% in real-world tests.
Reinforcement learning (RL) from pre-collected datasets before online interaction – known as offline-to-online RL – faces a fundamental tension: how to avoid overestimating out-of-distribution actions without blindly cloning potentially suboptimal dataset behaviors. Prior methods impose pessimism by down-weighting unseen actions, but this can anchor the policy to mediocre behaviors. Researchers Andrew Choi and Wei Xu propose RankQ, a new Q-learning objective that replaces uniform down-weighting with a self-supervised multi-term ranking loss. By learning relative action preferences – i.e., which actions are better than others – RankQ shapes the Q-function so that action gradients naturally steer toward higher-quality behaviors rather than simply mimicking the dataset.
The method is evaluated on two challenging settings. On standard sparse-reward D4RL benchmarks, RankQ achieves performance competitive with or superior to seven prior methods. The real impact, however, comes in vision-based robot learning: when fine-tuning a pretrained vision-language-action (VLA) model, RankQ delivers a 42.7% higher simulation success rate than the next best method in a low-data regime. In a high-data setting, it still improves by 13.7% over the next best. Most impressively, RankQ achieves strong sim-to-real transfer, boosting real-world cube stacking success from 43.1% to 84.7% relative to the VLA's initial performance. The work suggests that structured action ordering can replace explicit pessimism for more effective and sample-efficient RL fine-tuning.
- RankQ introduces a self-supervised multi-term ranking loss to enforce structured action ordering, replacing blanket pessimism used in prior offline-to-online RL methods.
- On D4RL benchmarks, RankQ matches or outperforms seven prior methods, including conservative approaches like CQL.
- In vision-based robot fine-tuning, RankQ boosts real-world cube stacking success from 43.1% to 84.7% and achieves 42.7% higher simulation success over the next best method in low-data settings.
Why It Matters
RankQ makes fine-tuning pretrained robot models dramatically more sample-efficient, enabling real-world skill transfer with fewer demonstrations.