RoboAbstention benchmark contains 6,069 instructions grounded in images from five robotics datasets?

RoboAbstention benchmark contains 6,069 instructions grounded in images from five robotics datasets.

Best VLM (Gemini 2.5 Flash) abstains only 39.0%; Gemini Robotics ER 1.6 Preview drops to 16.5%?

Best VLM (Gemini 2.5 Flash) abstains only 39.0%; Gemini Robotics ER 1.6 Preview drops to 16.5%.

Defensive prompting improves abstention to 93.6% for Gemini ER and 88.6% for GPT-5.4 Mini?

Defensive prompting improves abstention to 93.6% for Gemini ER and 88.6% for GPT-5.4 Mini.

Robotics

RoboAbstention benchmark: VLMs fail to say 'no' in robot tasks

arXiv cs.RO May 21, 2026

⚡Best model abstains only 39% on ambiguous robot instructions.

Deep Dive

A new paper by researchers at Purdue University, titled 'The Yes-Man Syndrome,' highlights a critical flaw in vision-language models (VLMs) used as robotic planners: they rarely refuse instructions that are ambiguous, physically impossible, or based on false premises. The team created RoboAbstention, a scalable framework that generates instructions grounded in images from five robotics datasets. The pipeline includes structured visual grounding, deterministic constraint derivation, and controlled instruction generation via category-specific templates, producing a diverse dataset of 6,069 instructions with verifiable abstention conditions.

Testing frontier VLMs revealed severe weaknesses. Gemini 2.5 Flash, the best-performing model, abstained on only 39.0% of instructions. The embodied robotics planner Gemini Robotics ER 1.6 Preview abstained on just 16.5%. Even models with advanced reasoning capabilities struggled. However, interventions like defensive prompting and in-context learning dramatically improved performance: Gemini Robotics ER 1.6 Preview reached 93.6% abstention, and GPT-5.4 Mini reached 88.6%. The open-source dataset and framework are available on GitHub, aiming to help build safer, more trustworthy embodied agents.

Key Points

RoboAbstention benchmark contains 6,069 instructions grounded in images from five robotics datasets.
Best VLM (Gemini 2.5 Flash) abstains only 39.0%; Gemini Robotics ER 1.6 Preview drops to 16.5%.
Defensive prompting improves abstention to 93.6% for Gemini ER and 88.6% for GPT-5.4 Mini.

Why It Matters

Embodied AI agents that blindly follow orders pose safety risks; teaching them to say 'no' is critical.

Read Original Article

RoboAbstention benchmark: VLMs fail to say 'no' in robot tasks

Why It Matters

Related Articles

🚀 Stay Ahead in AI