Robotics

EgoAERO learns dexterous robot manipulation from a single video

No 3D scans needed — just one egocentric RGB-D video to teach robots.

Deep Dive

EgoAERO, developed by a team led by Yichen Niu, tackles a long-standing bottleneck in robot learning: the need for costly pre-scanned object models. By using a single egocentric RGB-D video of a human hand manipulating an object, the framework reconstructs contact-consistent hand-object trajectories without any prior knowledge of the object’s geometry. It does this through asset-free object tracking, ego motion compensation, and adaptive contact optimization, all working together to infer both pose and interactions. These trajectories are then converted into robot policies via a two-stage residual learning approach, allowing a robot to replicate the dexterous task after seeing just one demonstration.

To support broader research, the authors introduce EgoDex-R, a large-scale dataset containing 4.3 million RGB-D frames of dexterous manipulations captured from an egocentric viewpoint. In both simulation and real-world experiments, EgoAERO achieves single-demonstration dexterous manipulation with performance approaching that of CAD-based reconstructions on the HOI4D benchmark. This marks a significant step toward scaling robot learning from human video, dramatically reducing the data and asset preparation required for teaching robots fine-grained manipulation skills.

Key Points
  • First framework to learn dexterous manipulation from a single egocentric RGB-D video without requiring pre-scanned object assets.
  • Uses asset-free object tracking, ego-motion compensation, and adaptive contact optimization to reconstruct contact-consistent hand-object trajectories.
  • Introduces EgoDex-R, a 4.3 million frame egocentric dataset for dexterous policy learning; performance matches CAD-based methods on HOI4D.

Why It Matters

Removes the need for 3D object scans, making robot learning from human video far more scalable and practical.