VLA-ATTC: Adaptive Test-Time Compute for VLA Models with Relative Action Critic Model
New framework lets vision-language-action models pause and deliberate before acting
Vision-Language-Action (VLA) models have shown impressive abilities in robotic manipulation, but they rely on fast, instinctive decision-making that can fail in complex or ambiguous situations. A new paper from researchers (including Wenhao Li, Xiu Su, and others) introduces VLA-ATTC, which gives these models adaptive test-time compute (TTC). The framework features an uncertainty-based "cognitive clutch" that dynamically switches from reflexive execution to a deliberate reasoning phase when needed. During that phase, a novel Relative Action Critic (RAC) model identifies the optimal action by comparing candidate actions pairwise, rather than predicting unstable absolute values. This relative approach simplifies learning and improves robustness.
On the LIBERO-LONG benchmark, VLA-ATTC reduces the failure rate of the current state-of-the-art model PI0.5 by over 50%. The system also includes an efficient sampling strategy to manage computational costs and an automated data pipeline that curates preference pairs without manual annotation. The authors plan to open-source all code and weights, making it accessible for robotics researchers and developers who want to add deliberative reasoning to their VLA models without expensive human labeling.
- VLA-ATTC introduced an uncertainty-based "cognitive clutch" that switches between fast execution and deliberate reasoning
- Novel Relative Action Critic (RAC) selects best action via pairwise comparisons, eliminating unstable absolute-value estimation
- Reduces failure rate of state-of-the-art PI0.5 model by over 50% on LIBERO-LONG benchmark; all code and weights will be open-sourced
Why It Matters
Gives robots the ability to think before acting, slashing failures in complex tasks without human annotation.