Vector Policy Optimization trains LLMs for diverse test-time search
VPO replaces GRPO's scalar reward with diversity-focused vector rewards
Deep Dive
Vector Policy Optimization (VPO) is an RL algorithm that trains LLMs to output diverse solutions anticipating multiple reward functions. Unlike standard scalar post-training, VPO exploits vector rewards (e.g., per-test-case correctness) and is a drop-in replacement for GRPO's advantage estimator. Across four tasks, VPO matches or beats scalar RL baselines on pass@k and best@k, and unlocks evolutionary search problems standard models cannot solve.
Key Points
- VPO replaces GRPO's advantage estimator with a vector reward approach to train LLM output diversity
- Matches/beats scalar RL baselines on pass@k and best@k across four tasks, widening with search budget
- Unlocks evolutionary search problems where standard GRPO models fail completely
Why It Matters
VPO could redefine LLM post-training to prioritize diversity, enabling robust inference-time search for complex, multi-objective tasks.