HoloMotion-1 lets humanoid robots learn any motion from videos
A new foundation model achieves zero-shot whole-body tracking on humanoid robots without fine-tuning.
HoloMotion-1, introduced by a team led by Maiyue Chen, is a foundation model designed for zero-shot whole-body motion tracking in humanoid robots. Its key innovation lies in scaling control-policy training with a large-scale hybrid motion corpus. This corpus primarily leverages video-reconstructed motions from in-the-wild videos, providing unprecedented motion diversity, while curated motion-capture and in-house data offer high-fidelity supervision and deployment-oriented coverage. This approach moves beyond traditional MoCap-only training, exposing the policy to broader behaviors, capture conditions, and motion styles.
To handle challenges like reconstruction noise, source-domain mismatch, and uneven motion quality, HoloMotion-1 integrates a large-capacity temporal modeling architecture with a sparsely activated Mixture-of-Experts Transformer. Its KV-cache inference enables real-time control, while a sequence-level training strategy improves learning efficiency on extended motion sequences. Extensive experiments on multiple unseen benchmarks show robust generalization and significant improvements in tracking accuracy over prior methods. The model transfers directly to a real humanoid robot without any task-specific fine-tuning, demonstrating its practical viability for general-purpose physical intelligence.
- Trained on hybrid motion corpus: video-reconstructed in-the-wild motions + MoCap data for diversity and fidelity.
- Uses sparse Mixture-of-Experts Transformer with KV-cache inference for real-time humanoid control.
- Achieves zero-shot transfer to an actual humanoid robot, outperforming prior methods on multiple unseen benchmarks.
Why It Matters
Advances humanoid robotics toward generalist control by learning from diverse, unstructured video data without manual fine-tuning.