Available in two sizes?

Nano (16B parameters) and Super (64B parameters) via Hugging Face.

Accepts text, image, video, and action trajectory inputs to generate dynamic video, image, audio, and action outputs?

Accepts text, image, video, and action trajectory inputs to generate dynamic video, image, audio, and action outputs.

Open Source

NVIDIA's Cosmos 3 Omnimodal models generate video, audio, and actions from any input

r/LocalLLaMA June 02, 2026

⚡World models that combine text, images, video, and actions for Physical AI at 16B and 64B scales.

Deep Dive

NVIDIA has released Cosmos 3, a suite of omnimodal world models now available on Hugging Face. The collection includes two model sizes—Nano at 16 billion parameters and Super at 64 billion—each capable of generating high-quality video, images, audio, and action commands from mixed inputs like text, images, video clips, and action trajectories. This marks a significant leap in multimodal AI, as Cosmos 3 can accept any combination of input modalities and output coherent dynamic content across multiple formats, mimicking a unified world model for both perception and action prediction.

The primary application of Cosmos 3 is in Physical AI, where it can drive world understanding, simulation, and embodied policy learning. By generating realistic, dynamic worlds from sparse or partial inputs, the models could power next-generation robotics training, autonomous vehicle simulation, and interactive 3D environments. Early community discussion on X (formerly Twitter) hints at potential use in reinforcement learning and agent-based tasks. With these models, NVIDIA continues to push toward a future where AI not only sees and hears but also moves and interacts in simulated physical spaces.

Key Points

Available in two sizes: Nano (16B parameters) and Super (64B parameters) via Hugging Face.
Accepts text, image, video, and action trajectory inputs to generate dynamic video, image, audio, and action outputs.
Designed for Physical AI applications: world understanding, simulation, and embodied policy learning.

Why It Matters

Powers next-gen robotics and simulation by generating coherent multimodal worlds from any input combination.

Read Original Article

NVIDIA's Cosmos 3 Omnimodal models generate video, audio, and actions from any input

Why It Matters

Related Articles

🚀 Stay Ahead in AI