NVIDIA's Cosmos 3 Omnimodal models generate video, audio, and actions from any input
World models that combine text, images, video, and actions for Physical AI at 16B and 64B scales.
NVIDIA has released Cosmos 3, a suite of omnimodal world models now available on Hugging Face. The collection includes two model sizes—Nano at 16 billion parameters and Super at 64 billion—each capable of generating high-quality video, images, audio, and action commands from mixed inputs like text, images, video clips, and action trajectories. This marks a significant leap in multimodal AI, as Cosmos 3 can accept any combination of input modalities and output coherent dynamic content across multiple formats, mimicking a unified world model for both perception and action prediction.
The primary application of Cosmos 3 is in Physical AI, where it can drive world understanding, simulation, and embodied policy learning. By generating realistic, dynamic worlds from sparse or partial inputs, the models could power next-generation robotics training, autonomous vehicle simulation, and interactive 3D environments. Early community discussion on X (formerly Twitter) hints at potential use in reinforcement learning and agent-based tasks. With these models, NVIDIA continues to push toward a future where AI not only sees and hears but also moves and interacts in simulated physical spaces.
- Available in two sizes: Nano (16B parameters) and Super (64B parameters) via Hugging Face.
- Accepts text, image, video, and action trajectory inputs to generate dynamic video, image, audio, and action outputs.
- Designed for Physical AI applications: world understanding, simulation, and embodied policy learning.
Why It Matters
Powers next-gen robotics and simulation by generating coherent multimodal worlds from any input combination.