Single unified omni-model replaces separate models for world prediction, reasoning, and action policy?

Single unified omni-model replaces separate models for world prediction, reasoning, and action policy

Mixture-of-Transformers (MoT) architecture with joint attention between autoregressive and diffusion token streams?

Mixture-of-Transformers (MoT) architecture with joint attention between autoregressive and diffusion token streams

Cosmos 3 Nano (8B+8B) for workstations and Cosmos 3 Super (32B+32B) for large-scale research

Open Source

NVIDIA Cosmos 3: Open omni-model unifies physical AI reasoning and action

Hugging Face Blog June 01, 2026

⚡A single MoT model handles world generation, reasoning, and action for robotics and AVs

Deep Dive

NVIDIA Cosmos 3 represents a major advance in physical world models. Unlike previous Cosmos releases that required separate models for prediction, reasoning, and policy, Cosmos 3 is a single omni-model using a Mixture-of-Transformers (MoT) architecture. It processes multiple modalities—text, image, video, audio, and action—through dedicated encoders into a shared representation space. The model splits input into an autoregressive subsequence for reasoning and a diffusion subsequence for generation, with joint attention enabling seamless switching between tasks like visual language modeling, video generation, and robot policy without architectural changes.

Two model sizes are available: Cosmos 3 Nano (8B parameters for the reasoner and 8B for the generator), optimized for workstation GPUs like the RTX PRO 6000, and Cosmos 3 Super (32B+32B), designed for large-scale synthetic data generation on Hopper/Blackwell GPUs. Both are open-sourced on Hugging Face with Diffusers integration and post-training scripts on GitHub. The model can generate physically plausible videos from text, images, or actions, reason about motion and causality, and predict future video/action sequences—critical for robotics, autonomous driving simulators, and warehouse safety training data.

Key Points

Single unified omni-model replaces separate models for world prediction, reasoning, and action policy
Mixture-of-Transformers (MoT) architecture with joint attention between autoregressive and diffusion token streams
Two sizes: Cosmos 3 Nano (8B+8B) for workstations and Cosmos 3 Super (32B+32B) for large-scale research

Why It Matters

NVIDIA Cosmos 3 simplifies physical AI development with one open model for simulation, reasoning, and action generation

Read Original Article

NVIDIA Cosmos 3: Open omni-model unifies physical AI reasoning and action

Why It Matters

Related Articles

🚀 Stay Ahead in AI