Research & Papers

Vector Policy Optimization trains LLMs for diverse test-time search

VPO replaces GRPO's scalar reward with diversity-focused vector rewards

Deep Dive

Vector Policy Optimization (VPO) is an RL algorithm that trains LLMs to output diverse solutions anticipating multiple reward functions. Unlike standard scalar post-training, VPO exploits vector rewards (e.g., per-test-case correctness) and is a drop-in replacement for GRPO's advantage estimator. Across four tasks, VPO matches or beats scalar RL baselines on pass@k and best@k, and unlocks evolutionary search problems standard models cannot solve.

Key Points
  • VPO replaces GRPO's advantage estimator with a vector reward approach to train LLM output diversity
  • Matches/beats scalar RL baselines on pass@k and best@k across four tasks, widening with search budget
  • Unlocks evolutionary search problems where standard GRPO models fail completely

Why It Matters

VPO could redefine LLM post-training to prioritize diversity, enabling robust inference-time search for complex, multi-objective tasks.