Research & Papers

Dynin-Omni: Omnimodal Unified Large Diffusion Language Model

A new masked-diffusion model scores 87.6 on GSM8K and 2.1 WER on speech, outperforming open-source rivals.

Deep Dive

A research team led by Jaeik Kim has unveiled Dynin-Omni, a groundbreaking AI model that unifies text, image, speech generation/understanding, and video understanding within a single, cohesive architecture. Unlike current approaches that either serialize different data types (autoregressive) or require stitching together separate expert models (compositional), Dynin-Omni formulates everything as a masked diffusion process over a shared token space. This allows the model to iteratively refine outputs using bidirectional context from any combination of modalities, enabling true any-to-any modeling.

Dynin-Omni was trained using a multi-stage strategy with model-merging for modality expansion. The team rigorously evaluated it across 19 multimodal benchmarks. Key results include a score of 87.6 on the GSM8K math reasoning test, 1733.6 on the MME-P multimodal evaluation, 61.4 on VideoMME, and a word error rate (WER) of just 2.1 on the LibriSpeech test-clean dataset for speech recognition. These scores consistently beat other open-source unified models and remain competitive with specialized, single-modality expert systems.

The research demonstrates the viability of masked diffusion as a unified paradigm for omnimodal AI. This architecture provides a flexible foundation for building real-time systems that can seamlessly retrieve and generate across modalities, and is a significant step toward creating more capable embodied multimodal agents that can interact with the world through multiple senses.

Key Points
  • First masked-diffusion model unifying text, image, speech, and video in one architecture, using a shared token space for iterative refinement.
  • Outperforms open-source unified models on 19 benchmarks, scoring 87.6 on GSM8K math and achieving 2.1 WER on LibriSpeech speech recognition.
  • Proposes a new training strategy with model-merging, positioning masked diffusion as a foundational paradigm for any-to-any multimodal AI systems.

Why It Matters

It provides a single, efficient model for cross-modal tasks, paving the way for more coherent and capable multimodal assistants and agents.