Vision-Language-Action Model, Robustness, Multi-modal Learning, Robot Manipulation
New decoupled training method makes vision-language-action models 12-16% more reliable against real-world noise and corruption.
A team of researchers has introduced STRONG-VLA, a novel framework designed to solve a critical weakness in today's most advanced robot brains: fragility. Vision-Language-Action (VLA) models, which allow robots to understand visual scenes and follow language commands, often fail dramatically when faced with real-world imperfections like blurry camera feeds, occlusions, or ambiguous instructions. Current methods try to train for robustness and task performance simultaneously, which creates a conflict in optimization goals. STRONG-VLA breaks this deadlock with a simple, two-stage, decoupled approach.
In Stage I, the model is exposed to a carefully designed curriculum of 28 different perturbation types—simulating everything from visual corruption to linguistic noise—with increasing difficulty. This builds a foundation of general robustness. In Stage II, the model is fine-tuned again on clean, high-quality task data to recover precise execution fidelity without losing its newfound resilience. The results are significant: on the LIBERO benchmark, the method boosted task success rates for the OpenVLA model by up to 12.60% against known perturbations and 7.77% against unseen ones. It showed even larger gains for other architectures like OpenVLA-OFT and pi0, proving its cross-model effectiveness.
The framework's practical value was confirmed through real-world testing on an AIRBOT robotic platform. By explicitly separating the learning of robustness from task alignment, STRONG-VLA provides a principled and effective path to creating AI agents that can operate reliably outside the sterile conditions of a lab. This work establishes a new benchmark for evaluating multimodal robustness and offers a scalable training blueprint for the next generation of dependable embodied AI.
- Two-stage decoupled training: First builds robustness against 28 perturbation types, then refines task performance on clean data.
- Significant performance gains: Achieved up to 16.49% higher task success rates on the LIBERO benchmark across multiple VLA model architectures.
- Real-world validation: Framework tested and proven effective on a physical AIRBOT robotic platform, moving beyond simulation.
Why It Matters
Enables more reliable robots and AI agents that can function in messy, unpredictable real-world environments, crucial for practical deployment.