HumanNet: One million hours of human video for embodied AI training
1000 hours of egocentric video beats 100 hours of real robot data
Progress in embodied intelligence has been bottlenecked by the lack of large-scale, diverse human activity data. While vision and language models can be trained on internet-scale corpora, physical interaction understanding requires real-world demonstrations of tool use, object manipulation, and long-horizon behaviors. Enter HumanNet, a one-million-hour human-centric video corpus created by researchers Yufan Deng and Daquan Zhou. The dataset covers both first-person and third-person perspectives, includes fine-grained activity annotations, motion descriptions, and hand/body signals, and is designed as a scalable substrate for representation learning, motion generation, and human-to-robot transfer.
The key insight is that HumanNet treats data curation as a first-class design principle—applying human-centric filtering, temporal structuring, viewpoint diversity, and annotation enrichment to transform raw internet video into a training resource. In a controlled experiment, the team took the Qwen VLM model and continued training it with 1000 hours of egocentric video from HumanNet. Remarkably, this outperformed continued training with 100 hours of real robot data from Magic Cobot, suggesting that human-centric video can serve as a cost-effective substitute for expensive robot data collection. This opens the door to scaling embodied foundation models using widely available human activity videos rather than relying solely on robot-specific datasets.
- Dataset contains 1 million hours of human activity videos with first- and third-person perspectives
- Includes interaction-centric annotations: captions, motion descriptions, hand/body signals
- 1000 hours egocentric HumanNet video outperformed 100 hours real robot data (Magic Cobot) in training Qwen VLM
- Designed for representation learning, motion generation, activity understanding, and human-to-robot transfer
Why It Matters
Human video could replace expensive robot data collection, drastically accelerating embodied AI development.