Robotics

New study benchmarks four methods to speed up VLA robot inference

A2C2 achieves 90% solve rate even with 8-step delays in robotic tasks

Deep Dive

Vision-Language-Action (VLA) models promise generalist robot control, but their inference latency creates observation staleness when actions are executed asynchronously. Four methods have emerged to combat this: IT-RTC (inference-time inpainting), TT-RTC (training-time delay simulation), VLASH (future-state-aware conditioning), and A2C2 (lightweight residual correction). Until now, they were evaluated independently with different codebases and protocols.

Ayoub Agouzoul's study is the first controlled comparison, using unified codebases on the Kinetix suite with MLPMixer policies and the LIBERO manipulation benchmark with SmolVLA, sweeping inference delays up to 20 control steps. A2C2's per-step residual correction proved most effective, holding above 90% solve rate up to d=8 on Kinetix and leading on LIBERO from d=4 onward. TT-RTC emerged as the most robust training-based method—stable across delay choices, generalizable, and adding zero inference overhead. IT-RTC was competitive at low delays but degraded sharply under long chunks (H=30) and high delays. VLASH showed a clear trade-off governed by fine-tuning delay range. These results give practitioners clear guidance for improving real-time VLA deployment.

Key Points
  • A2C2 per-step residual correction achieves >90% solve rate up to d=8 on Kinetix and leads on LIBERO from d=4
  • TT-RTC is robust across delay distributions, generalizes beyond training, and adds zero inference overhead
  • IT-RTC competitive at low delays but degrades sharply with long chunks (H=30) and high delays

Why It Matters

Real-time robot control gets a boost: A2C2 and TT-RTC enable faster, more reliable VLA-based autonomy.