Dynin-Omni: Omnimodal Unified Large Diffusion Language Model
A new masked-diffusion model scores 87.6 on GSM8K and 2.1 WER on speech, outperforming open-source rivals.
A research team led by Jaeik Kim has unveiled Dynin-Omni, a groundbreaking AI model that unifies text, image, speech generation/understanding, and video understanding within a single, cohesive architecture. Unlike current approaches that either serialize different data types (autoregressive) or require stitching together separate expert models (compositional), Dynin-Omni formulates everything as a masked diffusion process over a shared token space. This allows the model to iteratively refine outputs using bidirectional context from any combination of modalities, enabling true any-to-any modeling.
Dynin-Omni was trained using a multi-stage strategy with model-merging for modality expansion. The team rigorously evaluated it across 19 multimodal benchmarks. Key results include a score of 87.6 on the GSM8K math reasoning test, 1733.6 on the MME-P multimodal evaluation, 61.4 on VideoMME, and a word error rate (WER) of just 2.1 on the LibriSpeech test-clean dataset for speech recognition. These scores consistently beat other open-source unified models and remain competitive with specialized, single-modality expert systems.
The research demonstrates the viability of masked diffusion as a unified paradigm for omnimodal AI. This architecture provides a flexible foundation for building real-time systems that can seamlessly retrieve and generate across modalities, and is a significant step toward creating more capable embodied multimodal agents that can interact with the world through multiple senses.
- First masked-diffusion model unifying text, image, speech, and video in one architecture, using a shared token space for iterative refinement.
- Outperforms open-source unified models on 19 benchmarks, scoring 87.6 on GSM8K math and achieving 2.1 WER on LibriSpeech speech recognition.
- Proposes a new training strategy with model-merging, positioning masked diffusion as a foundational paradigm for any-to-any multimodal AI systems.
Why It Matters
It provides a single, efficient model for cross-modal tasks, paving the way for more coherent and capable multimodal assistants and agents.