Robotics

EvoScene-VLA boosts robot control by evolving scene beliefs across actions

New model raises success rate 2% on 31 tasks by remembering scene changes from robot actions

Deep Dive

Chunked vision-language-action (VLA) policies typically predict multi-step robot controls based only on the current visual observation, ignoring how robot actions alter the scene through contact, occlusion, and object motion. EvoScene-VLA, developed by researchers including Chushan Zhang and Hongdong Li, solves this by introducing a recurrent scene prefix that carries a geometry-aware scene state across control chunks. At each VLM call, the model combines current visual information with an action-updated prior from the previous chunk. The action decoder then outputs both the next action chunk and a compact scene update, which becomes the new prior for the next call. During training, a Scene Predictor supplies future scene-token targets and a Geometric Anchor aligns scene slots with frozen depth and 3D teachers—both modules are discarded at deployment, making the inference efficient.

On 31 RoboTwin tasks, EvoScene-VLA raises average success from 87.2% to 89.1% in fixed evaluation and from 86.1% to 88.5% in randomized evaluation. Real-robot experiments on the Galaxea R1-Lite platform also show consistent improvements over all baselines. The key innovation is maintaining a persistent action-updated scene state across control calls, allowing the robot to anticipate how its own movements change the environment before the next visual update arrives. This work bridges a gap between spatial VLAs (which improve current-frame geometry) and temporal VLAs (which aggregate past frames) by explicitly modeling scene evolution due to robot actions.

Key Points
  • EvoScene-VLA uses a recurrent scene prefix to carry geometry-aware scene state across action chunks.
  • On 31 RoboTwin tasks, fixed evaluation success rose from 87.2% to 89.1% and randomized from 86.1% to 88.5%.
  • Real-robot tests on Galaxea R1-Lite outperformed all baselines.

Why It Matters

Enables robots to anticipate scene changes from their own actions, improving precision in sequential tasks.