IJCAI 2026 survey: 4 ways to train robots from human videos
Human videos are cheap, robot data is scarce—here’s how to bridge the gap.
Robot manipulation models typically require expensive, embodiment-specific demonstrations. A survey by Zhiyuan Feng and 14 co-authors, accepted at IJCAI 2026, presents a unified framework for leveraging abundant human videos instead. They categorize approaches into four classes: (i) latent action representations that encode inter-frame changes, (ii) predictive world models that forecast future frames, (iii) explicit 2D supervision extracting image-plane cues, and (iv) explicit 3D reconstruction recovering geometry or motion. Each class addresses different aspects of the embodiment and annotation gap.
The survey also highlights three open challenges: turning unstructured human videos into training-ready episodes, grounding video-derived supervision into robot-executable actions despite embodiment and viewpoint differences, and designing evaluation protocols that better predict real-world deployment performance. The authors provide a curated list of papers and resources. This work offers a roadmap for scalable, human-data-driven robot learning, potentially reducing the cost and increasing the generality of embodied AI.
- Identifies four methods to extract action information from human videos: latent actions, world models, 2D supervision, and 3D reconstruction.
- Addresses core challenges: structuring unlabeled videos, bridging embodiment gaps, and improving evaluation protocols.
- Accepted at IJCAI 2026 Survey Track; includes a curated resource list for researchers.
Why It Matters
Scales robot learning by using cheap human videos, reducing reliance on costly robot demos and enabling more flexible embodied AI.