Robotics

IJCAI 2026 survey: 4 ways to train robots from human videos

Human videos are cheap, robot data is scarce—here’s how to bridge the gap.

Deep Dive

Robot manipulation models typically require expensive, embodiment-specific demonstrations. A survey by Zhiyuan Feng and 14 co-authors, accepted at IJCAI 2026, presents a unified framework for leveraging abundant human videos instead. They categorize approaches into four classes: (i) latent action representations that encode inter-frame changes, (ii) predictive world models that forecast future frames, (iii) explicit 2D supervision extracting image-plane cues, and (iv) explicit 3D reconstruction recovering geometry or motion. Each class addresses different aspects of the embodiment and annotation gap.

The survey also highlights three open challenges: turning unstructured human videos into training-ready episodes, grounding video-derived supervision into robot-executable actions despite embodiment and viewpoint differences, and designing evaluation protocols that better predict real-world deployment performance. The authors provide a curated list of papers and resources. This work offers a roadmap for scalable, human-data-driven robot learning, potentially reducing the cost and increasing the generality of embodied AI.

Key Points
  • Identifies four methods to extract action information from human videos: latent actions, world models, 2D supervision, and 3D reconstruction.
  • Addresses core challenges: structuring unlabeled videos, bridging embodiment gaps, and improving evaluation protocols.
  • Accepted at IJCAI 2026 Survey Track; includes a curated resource list for researchers.

Why It Matters

Scales robot learning by using cheap human videos, reducing reliance on costly robot demos and enabling more flexible embodied AI.