VPO replaces GRPO's advantage estimator with a vector reward approach to train LLM output diversity?

VPO replaces GRPO's advantage estimator with a vector reward approach to train LLM output diversity

Matches/beats scalar RL baselines on pass@k and best@k across four tasks, widening with search budget?

Matches/beats scalar RL baselines on pass@k and best@k across four tasks, widening with search budget

Unlocks evolutionary search problems where standard GRPO models fail completely?

Unlocks evolutionary search problems where standard GRPO models fail completely

Research & Papers

Vector Policy Optimization trains LLMs for diverse test-time search

arXiv cs.NE May 22, 2026

⚡VPO replaces GRPO's scalar reward with diversity-focused vector rewards

Deep Dive

Vector Policy Optimization (VPO) is an RL algorithm that trains LLMs to output diverse solutions anticipating multiple reward functions. Unlike standard scalar post-training, VPO exploits vector rewards (e.g., per-test-case correctness) and is a drop-in replacement for GRPO's advantage estimator. Across four tasks, VPO matches or beats scalar RL baselines on pass@k and best@k, and unlocks evolutionary search problems standard models cannot solve.

Key Points

VPO replaces GRPO's advantage estimator with a vector reward approach to train LLM output diversity
Matches/beats scalar RL baselines on pass@k and best@k across four tasks, widening with search budget
Unlocks evolutionary search problems where standard GRPO models fail completely

Why It Matters

VPO could redefine LLM post-training to prioritize diversity, enabling robust inference-time search for complex, multi-objective tasks.

Read Original Article

Vector Policy Optimization trains LLMs for diverse test-time search

Why It Matters

Related Articles

🚀 Stay Ahead in AI