Research & Papers

Lifting Embodied World Models for Planning and Control

New method uses 2D waypoints to control complex human avatars with 3.8x less error.

Deep Dive

Researchers from UC Berkeley (Alex N. Wang, Trevor Darrell, Pavel Izmailov, Yutong Bai, Amir Bar) have developed a novel framework called 'Lifted Embodied World Models' that simplifies planning and control for complex embodiments like human avatars. Traditional world models predict future observations based on low-level joint actions, but for humanoids with dozens of joints, this high-dimensional action space makes planning computationally expensive and difficult to control. The team addresses this by training a lightweight policy that lifts high-level actions—specifically, 2D waypoints annotated on the current observation frame for key joints (pelvis, head, hands)—into sequences of low-level joint actions. This composition with a frozen world model creates a lifted model that can predict future observations from a single high-level goal.

The results are impressive: the lifted model achieves 3.8x lower mean joint error to the goal pose compared to searching directly in low-level joint space, while also being more compute-efficient. The framework generalizes to environments unseen by the policy, making it practical for real-world applications like robotics, animation, and VR. By reducing the action dimensionality to just a few interpretable waypoints, the approach enables easier manual specification and more efficient search-based planning methods like CEM. This work, published on arXiv (2604.26182), represents a significant step toward making embodied AI more controllable and scalable for complex tasks.

Key Points
  • Lifted world models map 2D waypoints (pelvis, head, hands) to low-level joint sequences
  • 3.8x lower mean joint error to goal pose compared to direct joint-space search
  • Generalizes to unseen environments and is more compute-efficient than CEM planning

Why It Matters

Enables efficient, interpretable control of humanoid agents for robotics, animation, and VR applications.