Robotics

Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines

arXiv cs.RO April 28, 2026

⚡Data fidelity vs. cost trade-off limits large-scale robot learning, says TMLR paper.

Deep Dive

A comprehensive survey by Ziyao Wang and nine co-authors, accepted at the Transactions on Machine Learning Research (TMLR), shifts the focus in Vision-Language-Action (VLA) robotics from model architectures to data infrastructure. The paper systematically examines three pillars—datasets, benchmarks, and data engines—arguing that progress in embodied AI will increasingly depend on co-designing high-fidelity data pipelines and rigorous evaluation protocols. The authors categorize real-world and synthetic datasets by embodiment diversity, modality composition, and action space formulation, highlighting a persistent fidelity-cost trade-off that fundamentally constrains large-scale collection. They also analyze benchmark complexity and environment structure, identifying structural gaps in evaluating compositional generalization and long-horizon reasoning that current protocols fail to address.

On the data engine front, the survey covers simulation-based, video-reconstruction, and automated task-generation paradigms, noting their shared limitations in physical grounding and sim-to-real transfer. Synthesizing these findings, the paper distills four open challenges: representation alignment across modalities, multimodal supervision for richer learning signals, reasoning assessment for complex tasks, and scalable data generation. The authors call for treating data infrastructure as a first-class research problem rather than a background concern, a stance that could reshape priorities in robotics AI research.

Key Points

Survey categorizes real-world and synthetic VLA datasets by embodiment diversity, modality, and action space, revealing a fidelity-cost trade-off.
Identifies structural gaps in current benchmarks for compositional generalization and long-horizon reasoning evaluation.
Distills four open challenges: representation alignment, multimodal supervision, reasoning assessment, and scalable data generation.

Why It Matters

This survey reframes the VLA robotics bottleneck from models to data, guiding future research priorities for embodied AI.

Read Original Article

Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines

Why It Matters

Stay Ahead in AI