SceneBot: New framework unifies humanoid free-space and contact-rich motion
Trained on 7.5 hours of reconstructed data, it carries boxes upstairs.
SceneBot, developed by researchers at Stanford University (Sirui Chen, Shibo Zhao, Zhen Wu, Jiaman Li, Guanya Shi, C. Karen Liu), addresses a key limitation in humanoid robotics: existing reinforcement learning policies excel in free-space motions but fail at contact-rich tasks (e.g., carrying objects, traversing uneven terrain) because pure kinematic tracking cannot resolve the physical ambiguities of interaction. SceneBot overcomes this by conditioning a single policy on both reference motions and per-link contact labels, which explicitly define expected environmental interactions. To generate the necessary training data, the team created a hindsight scene reconstruction approach that infers scene-interaction graphs from retargeted human motion, providing 7.5 hours of reconstructed, contact-rich data.
Trained on this data, SceneBot demonstrates the ability to seamlessly unify free-space and contact-rich behaviors, executing complex, long-horizon tasks such as carrying a box upstairs for the first time in a general framework. Results show generalization to unseen motions and environments, establishing contact conditioning as a powerful interface for humanoid control. The paper (15 pages, 10 figures) is available on arXiv and all code and data will be open-sourced, with demos at the project website. This work significantly advances the field by bridging the gap between locomotion and manipulation in humanoid robots.
- SceneBot unifies free-space locomotion, terrain traversal, and whole-body manipulation in a single policy.
- Uses per-link contact labels and a new hindsight scene reconstruction to generate 7.5 hours of training data.
- First general framework to execute complex tasks like carrying a box upstairs; code and data to be open-sourced.
Why It Matters
Enables humanoid robots to handle real-world contact-rich tasks, moving beyond simple walking into practical object interaction.