Research & Papers

X Square Robot's Wall-OSS-0.5: 4B VLA beats pi0.5 by 17.5pp on real robots

Zero-shot on 17 tasks, including rope tightening at 82 progress.

Deep Dive

X Square Robot has open-sourced Wall-OSS-0.5, a 4-billion parameter Vision-Language-Action (VLA) model that sets new benchmarks for real-robot manipulation. Built on a 3B VLM backbone and employing a Mixture-of-Transformers architecture with specialized action experts, the model was evaluated both zero-shot and after fine-tuning on physical robots. In zero-shot testing across a 17-task suite, it achieved over 80 task progress on 4 tasks, including the challenging deformable-object task "Rope Tightening" (82 progress). After fine-tuning on a 15-task suite, Wall-OSS-0.5 reached an average task progress of 60.5, outperforming the prior state-of-the-art pi0.5 by 17.5 percentage points (and by 26pp on the 10-task manipulation subset). The model also showed a 21.8pp improvement in embodied grounding while maintaining stable general vision-language ability.

The technical innovations behind these gains include a "gradient bridge" mechanism that ensures discrete action-token cross-entropy dominates the gradient flow into the VLM backbone, while flow matching's contribution collapses to roughly 5% after a few thousand steps. For continuous actions, flow matching is supervised in recovered action space rather than velocity space. The Vision-Aligned Residual Vector Quantization (RVQ) tokenizer grounds action tokens semantically rather than as mere numerical compression. Additionally, the team introduces DMuon, a distributed version of the Muon optimizer, with aggressive overhead reduction claims. The full code and pretrained weights are available on GitHub and Hugging Face, and the paper is linked. The community is invited to reproduce results on real hardware.

Key Points
  • Wall-OSS-0.5 is a 4B VLA model from X Square Robot built on a 3B VLM backbone with Mixture-of-Transformers action experts.
  • Zero-shot real-robot evaluation on 17 tasks: 4 tasks above 80 task progress, including deformable rope tightening at 82.
  • After fine-tuning on 15 tasks, it achieved 60.5 average task progress, beating pi0.5 by +17.5pp and +26pp on manipulation subset.

Why It Matters

Open-source VLA with strong zero-shot real-robot performance could accelerate practical robotics deployment and reproducibility.