S2: New AI framework boosts robot task success from 54% to 79%
Training robots to ignore distractions and follow refined instructions lifts success by 25 points.
Generalization remains a central bottleneck for vision-language-action (VLA) models, which often struggle under visual distractors, appearance shifts, and semantically similar tasks. A new paper by Yueh-Hua Wu, Tatsuya Matsushima, and Kei Ota introduces S2 (See Less, Specify More), a framework that tackles this by reframing the learning problem for the executor. The key innovations are twofold: "Specify More" preserves the original high-level instruction while adding refined trajectory- and subtask-level language that disambiguates current execution mode, and "See Less" imposes an explicit visual evidence budget that trains the executor to act from task-sufficient evidence rather than unconstrained visual context—all without any region or mask annotations.
In evaluations across eight real-robot tasks on TX-G2 (an AgiBot G2-compatible variant) and HSR, S2 dramatically improved performance. Mean subtask success rose from 54.2% to 79.0% over the pi0.5 baseline, a relative gain of over 45%. The framework remains compatible with off-the-shelf VLM planners through in-context learning. The results suggest that VLA generalization improves when the executor is trained to act from informative local guidance and task-sufficient visual evidence, rather than being forced to recover both from weak supervision. This approach could significantly enhance the robustness of robotic policies in real-world environments.
- S2 boosts VLA robot subtask success from 54.2% to 79.0% across 8 real-world tasks.
- Combines 'See Less' (visual evidence budget) and 'Specify More' (refined local language instructions) without extra annotations.
- Fully compatible with existing VLM planners via in-context learning—no model retraining required.
Why It Matters
Makes robot policies far more robust to visual clutter and ambiguous instructions, a leap toward practical autonomous systems.