MetaWorld turns single-view videos into multi-agent world models
No multi-camera setups needed—just one view to generate consistent multi-agent video worlds
Video world models are crucial for embodied AI and the Metaverse, but existing approaches only support a single agent from one perspective. Scaling to multi-agent settings faces two hurdles: expensive multi-view data collection and ensuring that independently generated video streams share a consistent physical reality. MetaWorld, introduced by a team led by Teng Hu, overcomes both challenges by operating directly from single-view footage.
MetaWorld's core innovation is Monocular World-State Unrolling (MWSU), which decomposes monocular video into the camera's ego-motion and the target subject's spatial trajectory. This naturally extracts synchronized multi-agent motion data within a shared 3D space without requiring any multi-camera setups. The framework also includes a Subject-Aware World Generator for appearance-driven simulation conditioned on per-agent identity images, enabling precise visual control over individual agents.
To ensure both egocentric views remain grounded in the same physical reality, MetaWorld employs World-State Alignment (WSA)—a per-frame inter-branch cross-attention mechanism inserted at every transformer layer of the video DiT. WSA synchronizes the denoising process across views, enforcing both static geometric consistency and dynamic motion consistency. Experiments demonstrate superior cross-view consistency and identity fidelity compared to prior methods.
- Monocular World-State Unrolling (MWSU) extracts camera ego-motion and subject trajectories from single-view video, eliminating need for multi-camera data
- World-State Alignment (WSA) uses inter-branch cross-attention across all transformer layers to synchronize denoising, ensuring consistent physical reality between views
- Achieves superior cross-view consistency and identity fidelity, establishing a scalable paradigm for multi-agent video world modeling
Why It Matters
Enables scalable, cost-effective multi-agent world models from accessible single-view video data for embodied AI and Metaverse