NVIDIA Cosmos 3: Open omni-model unifies physical AI reasoning and action
A single MoT model handles world generation, reasoning, and action for robotics and AVs
NVIDIA Cosmos 3 represents a major advance in physical world models. Unlike previous Cosmos releases that required separate models for prediction, reasoning, and policy, Cosmos 3 is a single omni-model using a Mixture-of-Transformers (MoT) architecture. It processes multiple modalities—text, image, video, audio, and action—through dedicated encoders into a shared representation space. The model splits input into an autoregressive subsequence for reasoning and a diffusion subsequence for generation, with joint attention enabling seamless switching between tasks like visual language modeling, video generation, and robot policy without architectural changes.
Two model sizes are available: Cosmos 3 Nano (8B parameters for the reasoner and 8B for the generator), optimized for workstation GPUs like the RTX PRO 6000, and Cosmos 3 Super (32B+32B), designed for large-scale synthetic data generation on Hopper/Blackwell GPUs. Both are open-sourced on Hugging Face with Diffusers integration and post-training scripts on GitHub. The model can generate physically plausible videos from text, images, or actions, reason about motion and causality, and predict future video/action sequences—critical for robotics, autonomous driving simulators, and warehouse safety training data.
- Single unified omni-model replaces separate models for world prediction, reasoning, and action policy
- Mixture-of-Transformers (MoT) architecture with joint attention between autoregressive and diffusion token streams
- Two sizes: Cosmos 3 Nano (8B+8B) for workstations and Cosmos 3 Super (32B+32B) for large-scale research
Why It Matters
NVIDIA Cosmos 3 simplifies physical AI development with one open model for simulation, reasoning, and action generation