Robotics

ETA-VLA: Efficient Token Adaptation via Temporal Fusion and Intra-LLM Sparsification for Vision-Language-Action Models

arXiv cs.RO March 30, 2026

⚡New method prunes 85% of visual tokens, cutting FLOPs by 61% while retaining 94% accuracy.

Deep Dive

A research team from Tsinghua University and the Shanghai AI Lab has introduced ETA-VLA, a novel framework designed to tackle the massive computational cost of Vision-Language-Action (VLA) models in autonomous driving. These models, which interpret multi-camera feeds and text prompts to make driving decisions, traditionally suffer from a quadratic scaling problem in their attention mechanisms when processing historical video frames. ETA-VLA's core innovation is the Intra-LLM Sparse Aggregator (ILSA), which mimics human attention by dynamically identifying and pruning up to 85% of redundant visual tokens from past frames, guided by the current textual query and temporal scene consistency.

Extensive testing on the NAVSIM v2 autonomous driving benchmark shows the method's remarkable efficiency. ETA-VLA achieves driving performance on par with state-of-the-art models while slashing overall computational FLOPs by approximately 32%. More impressively, during the inference phase—the critical moment of making a driving decision—it reduces FLOPs by 61%. Despite this drastic reduction in processed data, the system retains 94% of the original model's accuracy, proving that not all visual tokens are created equal for situational reasoning.

The breakthrough lies in its text-guided scoring and diversity-preserving sparsification strategy. Instead of processing every pixel from every past camera frame with equal priority, ILSA learns to focus computational resources on the sparse subset of tokens most relevant to the driving task at hand, such as a changing traffic light or an approaching pedestrian. This moves AI driving systems closer to real-time feasibility by directly addressing one of their biggest bottlenecks: the unsustainable compute required for temporal reasoning over long video sequences.

Key Points

Prunes 85% of visual tokens using a novel Intra-LLM Sparse Aggregator (ILSA) guided by text and time.
Cuts inference FLOPs by 61% while retaining 94% of original accuracy on the NAVSIM v2 benchmark.
Reduces overall computational cost by ~32%, making high-fidelity VLA models for self-driving cars more efficient.

Why It Matters

Dramatically lowers the compute barrier for sophisticated AI driving systems, enabling more complex reasoning in real-time applications.

Read Original Article

ETA-VLA: Efficient Token Adaptation via Temporal Fusion and Intra-LLM Sparsification for Vision-Language-Action Models

Why It Matters

Stay Ahead in AI