Robotics data suffers from interoperability, not scarcity — ML students
Two ML students challenge the narrative that robotics has a data shortage.
Two machine learning students, after months of working with vision-language-action models (VLAs) and robotics datasets, discovered a surprising bottleneck: data interoperability. Despite the common narrative that robotics lacks sufficient training data, they found that publicly available datasets are abundant but fragmented. Each dataset comes with its own assumptions — different schemas, sensor configurations, coordinate frames, and metadata standards — requiring weeks of preprocessing just to get data into a usable format.
This led them to hypothesize that the robotics ecosystem has a data interoperability problem, not a data scarcity problem. They are now considering a massive experiment: gathering essentially every public robot-learning dataset they can find, normalizing them into a common schema, enriching them with metadata and quality signals, and releasing everything back to the community through a single open API. Before committing months to this effort, they are soliciting feedback from practitioners on whether this would actually be useful, or if deeper issues like embodiment mismatch, data quality, or labeling present bigger blockers. They explicitly avoid proposing a marketplace or proprietary platform, aiming purely for an open resource.
- Two ML students observed that robotics datasets vary widely in schema, coordinate frames, and metadata, causing weeks of preprocessing.
- They hypothesize the core issue is data interoperability, not scarcity, challenging the popular narrative.
- They plan to normalize all public robot-learning datasets into a common open schema with metadata and quality signals, and seek community input on its utility.
Why It Matters
Addressing data interoperability could unlock cross-embodiment learning and accelerate robotics research for autonomous systems.