FUTURE-VLA: Forecasting Unified Trajectories Under Real-time Execution
New model processes long video streams in real-time, achieving 99.2% success on robot tasks.
Researchers from Tsinghua University and others developed FUTURE-VLA, a unified vision-language-action model for robots. It uses temporal compression and latent-space autoregression to process extensive multi-view histories while maintaining constant inference latency. The system achieved 99.2% success on LIBERO benchmarks and extended spatiotemporal windows 16x over baselines, enabling real-time future forecasting and human-in-the-loop validation through interactive execution gating.
Why It Matters
Enables safer, more predictable autonomous robots that can preview actions for human approval before execution.