CLAW learns world models from videos with no action labels
Self-supervised AI learns how to act just by watching, no human input needed.
CLAW (Continuous Latent Action World Models) is a novel end-to-end self-supervised framework developed by researchers at multiple institutions (Tewodros Ayalew, Matthew Jeung, Samuel Wheeler, Xiao Zhang, Andre de la Cruz Arce, Kaylene Stocking, Michael Maire, Matthew R. Walter). The key innovation is its ability to learn a world model jointly with continuous latent action representations directly from action-free videos, using adversarial latent regularization and diffusion-based video generation. This allows the model to reason about how inferred actions cause environment transitions purely from visual observations, without any action labels or human annotations.
The framework simultaneously trains a Latent Action Model and a world model, enabling two downstream tasks: imitation learning from observation (behavior cloning by extracting latent actions from raw video) and goal-directed planning (generating sequences of latent actions and mapping them to executable actions). Extensive experiments across diverse tasks and embodiments demonstrate that CLAW produces semantically meaningful latent action representations, supports effective action transfer, and outperforms existing methods in both planning and imitation. The paper is 8 pages plus 15 pages of supplementary material, published on arXiv under cs.RO.
- CLAW learns both world model and latent actions from action-free videos using adversarial latent regularization.
- Enables imitation learning via behavior cloning from raw video without explicit action labels.
- Supports goal-directed planning by generating sequences of latent actions mapped to executable actions.
Why It Matters
CLAW removes the need for costly action labels, making robot learning scalable from passive video data.