SUGAR framework lets humanoids learn from YouTube videos without teleoperation
No reward engineering or manual demos needed—just raw human footage.
Building humanoid robots that can perform general whole-body manipulation in the real world has been a hard problem—existing methods rely on tedious reward engineering, rigid motion replays, or costly teleoperation. A new paper from Peking University and collaborators introduces SUGAR (Scalable Human-Video-Driven Generalizable Humanoid Loco-Manipulation Learning Framework), which directly leverages abundant human videos to train humanoid robots without any task-specific reward engineering or reference-motion conditioning at inference. The key insight: motion priors extracted from humans are imperfect (occlusion, contact errors, retargeting issues), so SUGAR uses a privileged physics-based refiner to fix them before distilling into a policy.
SUGAR proceeds in three stages. First, an automated pipeline extracts kinematic interaction priors—human-object motion trajectories and contact labels—from raw, unstructured videos. Second, a physics-based refiner uses a unified mimic reward and a progressive state pool to turn these imperfect priors into high-fidelity, physically feasible skills. Third, those refined skills are distilled into a hierarchical autonomous policy composed of a command generator and a command tracker. Evaluated on six representative loco-manipulation tasks (both in simulation and on real humanoid hardware), SUGAR substantially outperforms reference-tracking baselines. It achieves zero-shot real-world transfer with closed-loop execution, autonomous failure recovery, and stable performance under external perturbations—and crucially, its performance scales clearly with the amount of human video data.
- No task-specific reward engineering or teleoperation required; training data comes from unstructured human videos.
- Three-stage pipeline: extract priors from videos, refine with physics-based mimic reward, distill into hierarchical policy.
- Tested on six loco-manipulation tasks with zero-shot real-world transfer, failure recovery, and scaling with data volume.
Why It Matters
Scraping YouTube to teach humanoids complex skills could dramatically cut robotics training costs and enable general-purpose manipulation.