Robotics

StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

Open-source framework integrates Qwen-VL, Cosmos backbones and 5 major robotics benchmarks in modular architecture.

Deep Dive

The StarVLA Community has launched StarVLA, a comprehensive open-source framework designed to unify the fragmented field of Vision-Language-Action (VLA) research. VLA models are crucial for developing generalist embodied AI agents that can perceive, understand language, and take actions in the world. Currently, research is hampered by incompatible architectures, codebases, and evaluation protocols. StarVLA tackles this by providing a modular, 'Lego-like' backbone-action-head architecture where components like the multimodal backbone (e.g., Qwen-VL or Cosmos) and the action-decoding head can be independently swapped. This allows for principled comparison and rapid prototyping of new VLA methods.

Beyond architecture, StarVLA delivers reusable training strategies like cross-embodiment learning and integrates five major robotics benchmarks—LIBERO, SimplerEnv, RoboTwin~2.0, RoboCasa-GR1, and BEHAVIOR-1K—through a single evaluation interface that works for both simulation and real robots. Remarkably, the project includes simple, reproducible training recipes that, with minimal data engineering, already achieve performance matching or exceeding prior state-of-the-art methods on multiple benchmarks. As one of the most comprehensive open-source VLA frameworks available, StarVLA significantly lowers the barrier to entry for researchers, enabling faster reproduction of existing work and accelerating innovation in embodied AI.

Key Points
  • Modular architecture supports swapping between VLM backbones (Qwen-VL) and world-model backbones (Cosmos) with independent action heads.
  • Unified evaluation across 5 major benchmarks (LIBERO, BEHAVIOR-1K, etc.) for both simulation and real-robot deployment.
  • Simple training recipes achieve state-of-the-art results, accelerating reproducible research and new method prototyping.

Why It Matters

It standardizes a fragmented research field, enabling faster development of robots and AI agents that can see, understand, and act.