Robotics

EgoPush: Learning End-to-End Egocentric Multi-Object Rearrangement for Mobile Robots

New framework achieves zero-shot sim-to-real transfer, outperforming end-to-end RL baselines in success rates.

Deep Dive

A research team from NYU and collaborating institutions has introduced EgoPush, a breakthrough framework enabling mobile robots to rearrange multiple objects in cluttered environments using only egocentric vision. Unlike traditional approaches that rely on explicit global state estimation—which often fails in dynamic scenes—EgoPush operates with a single onboard camera, mimicking human spatial reasoning by encoding relative spatial relations among objects rather than absolute poses.

The technical innovation centers on a teacher-student distillation architecture. A privileged reinforcement learning teacher jointly learns latent states and mobile actions from sparse keypoints, then transfers this knowledge to a purely visual student policy. To bridge the supervision gap between omniscient teacher and partially observed student, the team restricted the teacher's observations to visually accessible cues, inducing active perception behaviors recoverable from the student's viewpoint. For long-horizon tasks, they decomposed rearrangement into stage-level subproblems using temporally decayed, stage-local completion rewards.

Extensive simulation experiments show EgoPush significantly outperforms end-to-end RL baselines in success rates, with ablation studies validating each design choice. Most impressively, the team demonstrated zero-shot sim-to-real transfer on actual mobile platforms, proving the framework's practical viability. This represents a major step toward robots that can operate in unstructured human environments without requiring precise maps or external sensors.

Key Points
  • Uses single egocentric camera instead of global state estimation that fails in dynamic scenes
  • Teacher-student distillation achieves zero-shot sim-to-real transfer to physical mobile platforms
  • Object-centric latent space encodes relative spatial relations, outperforming end-to-end RL baselines

Why It Matters

Enables practical mobile robots for home/organization tasks without expensive sensor arrays or perfect environmental maps.