30B total parameters with only 3B active using Mamba2 Transformer Hybrid MoE architecture for efficiency?

30B total parameters with only 3B active using Mamba2 Transformer Hybrid MoE architecture for efficiency

Supports 131K token context length and processes video (2 min), audio (1 hour), images, and text in one pass?

Supports 131K token context length and processes video (2 min), audio (1 hour), images, and text in one pass

Enables computer-use agents, document intelligence, and audio/video understanding without splitting across models?

Enables computer-use agents, document intelligence, and audio/video understanding without splitting across models

Developer Tools

NVIDIA's Nemotron 3 Nano Omni lands on SageMaker for unified multimodal AI

AWS Machine Learning Blog April 29, 2026

⚡One model now processes video, audio, images, and text in a single pass...

Deep Dive

NVIDIA has released Nemotron 3 Nano Omni on Amazon SageMaker JumpStart, a multimodal large language model that unifies video, audio, image, and text understanding into a single architecture. With 30 billion total parameters and 3 billion active parameters (30B A3B), it uses a Mamba2 Transformer Hybrid Mixture of Experts (MoE) design, combining the Nemotron 3 Nano LLM backbone, CRADIO v4-H vision encoder, and Parakeet speech encoder. The model supports a 131K token context length, chain-of-thought reasoning, tool calling, JSON output, and word-level timestamps for transcription. It accepts video (up to 2 minutes, 256 frames), audio (up to 1 hour), images (JPEG, PNG), and text inputs, outputting text in FP8 precision for efficiency.

This release addresses a key pain point in enterprise agent workflows, which traditionally stitch together separate models for vision, speech, and language—increasing latency, orchestration complexity, and cost. Nemotron 3 Nano Omni functions as a multimodal perception sub-agent, providing eyes and ears to agent systems in a single inference pass. Use cases include computer-use agents for GUI navigation (e.g., incident management, browser automation), document intelligence for contracts and financial documents, and audio/video understanding for meeting analysis, customer service review, and package delivery verification via OCR. The model is licensed for commercial use under the NVIDIA Open Model Agreement.

Key Points

30B total parameters with only 3B active using Mamba2 Transformer Hybrid MoE architecture for efficiency
Supports 131K token context length and processes video (2 min), audio (1 hour), images, and text in one pass
Enables computer-use agents, document intelligence, and audio/video understanding without splitting across models

Why It Matters

Replaces fragmented multimodal pipelines with a single model, slashing latency and costs for enterprise AI agents.

Read Original Article

NVIDIA's Nemotron 3 Nano Omni lands on SageMaker for unified multimodal AI

Why It Matters

Related Articles

🚀 Stay Ahead in AI