ERNIE's NAVA: 6.3B-parameter model generates synchronized audio-video from text
2x-5x fewer parameters than baselines, yet sets new SOTA on Verse-Bench
ERNIE Research has introduced NAVA, a 6.3 billion-parameter joint audio-video generation model that can produce synchronized video and audio from a single text prompt. Unlike previous approaches that either align audio and video post-hoc or use fully unified tri-modal stacks, NAVA employs an innovative Align-then-Fuse MMDiT (Multi-Modal Diffusion Transformer) architecture. The model first establishes audio-video correspondence in a dedicated alignment space, then fuses context — including text, speaker embeddings, and timbre references — via cross-attention mechanisms. This design allows NAVA to handle multi-speaker speech generation with reference-timbre control, and even continue video-and-audio sequences from an input image while maintaining temporal coherence.
NAVA's performance has been benchmarked on Verse-Bench, where it sets new state-of-the-art results across multiple metrics: Sync-C (synchronization accuracy for consonants), Sync-D (synchronization for vowels), overall video quality, and audio word error rate (WER). Remarkably, it achieves these results while using 2× to 5× fewer parameters compared to existing open-source baselines. The model is available on Hugging Face, and the code is open-sourced on GitHub, enabling researchers and developers to experiment with joint audio-video generation. This efficiency breakthrough could accelerate applications in film dubbing, virtual avatars, and real-time multimedia content creation where synchronized audio-video is critical.
- 6.3B-parameter Align-then-Fuse MMDiT architecture that separates alignment from fusion for better audio-video synchronization
- New SOTA on Verse-Bench: Sync-C, Sync-D, video quality, and audio WER, despite using 2x-5x fewer parameters than baselines
- Supports multi-speaker speech with reference-timbre control and image-conditioned video continuation from a single prompt
Why It Matters
Enables efficient, synchronized audio-video generation from text, reducing computational cost and opening doors for real-time dubbing and virtual content.