VeGAS boosts embodied AI agents by 36% with verifier-guided action selection
Think twice: new method checks actions before acting, boosting success by 36%.
A team of researchers (Nishad Singhi et al.) from multiple institutions introduced VeGAS (Verifier-Guided Action Selection) at CVPR 2026 to address the fragility of multimodal large language model (MLLM)-based embodied agents in out-of-distribution scenarios. Traditional chain-of-thought (CoT) reasoning often fails when faced with unexpected obstacles. VeGAS operates at inference time by sampling multiple candidate actions from the model, then using a dedicated generative verifier to select the most reliable one—all without modifying the underlying policy. This 'think twice, act once' approach dramatically improves robustness.
To train the verifier effectively, the authors developed an LLM-driven data synthesis strategy that automatically builds a diverse curriculum of failure cases, exposing the verifier to a rich distribution of potential errors. Without this synthetic training, using an off-the-shelf MLLM as a verifier yielded no improvement. Across challenging embodied reasoning benchmarks in Habitat and ALFRED environments, VeGAS achieved up to a 36% relative performance gain over strong CoT baselines, especially on long-horizon, multi-object tasks. The paper was accepted as a CVPR 2026 Finding.
- VeGAS samples an ensemble of candidate actions at inference time and uses a generative verifier to select the best, without modifying the base policy.
- An LLM-driven data synthesis strategy automatically creates a curriculum of failure cases to train the verifier—critical for effectiveness.
- Achieves up to 36% relative improvement on long-horizon, multi-object tasks in Habitat and ALFRED benchmarks over strong CoT baselines.
Why It Matters
Makes embodied AI agents more reliable in complex, real-world scenarios by verifying actions before execution.