Open Source

Introducing NVIDIA Nemotron 3 Nano Omni: Long-Context Multimodal Intelligence for Documents, Audio and Video Agents

NVIDIA's new omni-modal model beats Qwen3-Omni in audio, video, and document reasoning.

Deep Dive

NVIDIA has unveiled Nemotron 3 Nano Omni, a new omni-modal understanding model built for real-world document analysis, multiple image reasoning, automatic speech recognition, long audio-video understanding, agentic computer use, and general reasoning. It extends the Nemotron line from a vision-language system to a broader text+image+video+audio model. The architecture combines the Nemotron 3 hybrid Mamba-Transformer Mixture-of-Experts backbone with a C-RADIOv4-H vision encoder and Parakeet-TDT-0.6B-v2 audio encoder, designed to preserve fine visual detail and scale to very long multimodal contexts for dense images, documents, and videos.

Nemotron 3 Nano Omni delivers best-in-class accuracy on complex document intelligence leaderboards such as MMlongBench-Doc and OCRBenchV2, while also leading in video and audio leaderboards like WorldSense and DailyOmni. It achieves top accuracy on VoiceBench for audio understanding and ranks as the most cost-efficient open video understanding model on MediaPerf. The model delivers up to 9x higher throughput and 2.9x the single-stream reasoning speed on multimodal use-cases compared to alternatives. It also leads the open-weights Qwen3-Omni in many domains, including document understanding (MMLongBench-Doc: 57.5 vs 49.5), video understanding (Video-MME: 72.2 vs 70.5), and audio understanding (VoiceBench: 89.4 vs 88.8). The training recipe uses staged multimodal alignment and context extension, followed by preference optimization and multimodal reinforcement learning.

Key Points
  • Nemotron 3 Nano Omni achieves 57.5 on MMLongBench-Doc, beating Qwen3-Omni's 49.5, and 65.8 on OCRBenchV2-En.
  • Delivers up to 9.2x higher system efficiency for video use cases and 7.4x for multi-document use cases at fixed interactivity thresholds.
  • Supports 100+ page documents, automatic speech recognition (ASR) with 5.95 HF Open ASR score, and long audio-video understanding with 74.1 on DailyOmni.

Why It Matters

NVIDIA's omni model sets a new efficiency standard for multimodal AI, enabling faster, cheaper enterprise document and video analysis.