S2 boosts VLA robot subtask success from 54.2% to 79.0% across 8 real-world tasks?

S2 boosts VLA robot subtask success from 54.2% to 79.0% across 8 real-world tasks.

Combines 'See Less' (visual evidence budget) and 'Specify More' (refined local language instructions) without extra annotations?

Combines 'See Less' (visual evidence budget) and 'Specify More' (refined local language instructions) without extra annotations.

Fully compatible with existing VLM planners via in-context learning—no model retraining required?

Fully compatible with existing VLM planners via in-context learning—no model retraining required.

Robotics

S2: New AI framework boosts robot task success from 54% to 79%

arXiv cs.RO June 03, 2026

⚡Training robots to ignore distractions and follow refined instructions lifts success by 25 points.

Deep Dive

Generalization remains a central bottleneck for vision-language-action (VLA) models, which often struggle under visual distractors, appearance shifts, and semantically similar tasks. A new paper by Yueh-Hua Wu, Tatsuya Matsushima, and Kei Ota introduces S2 (See Less, Specify More), a framework that tackles this by reframing the learning problem for the executor. The key innovations are twofold: "Specify More" preserves the original high-level instruction while adding refined trajectory- and subtask-level language that disambiguates current execution mode, and "See Less" imposes an explicit visual evidence budget that trains the executor to act from task-sufficient evidence rather than unconstrained visual context—all without any region or mask annotations.

In evaluations across eight real-robot tasks on TX-G2 (an AgiBot G2-compatible variant) and HSR, S2 dramatically improved performance. Mean subtask success rose from 54.2% to 79.0% over the pi0.5 baseline, a relative gain of over 45%. The framework remains compatible with off-the-shelf VLM planners through in-context learning. The results suggest that VLA generalization improves when the executor is trained to act from informative local guidance and task-sufficient visual evidence, rather than being forced to recover both from weak supervision. This approach could significantly enhance the robustness of robotic policies in real-world environments.

Key Points

S2 boosts VLA robot subtask success from 54.2% to 79.0% across 8 real-world tasks.
Combines 'See Less' (visual evidence budget) and 'Specify More' (refined local language instructions) without extra annotations.
Fully compatible with existing VLM planners via in-context learning—no model retraining required.

Why It Matters

Makes robot policies far more robust to visual clutter and ambiguous instructions, a leap toward practical autonomous systems.

Read Original Article

S2: New AI framework boosts robot task success from 54% to 79%

Why It Matters

Related Articles

🚀 Stay Ahead in AI