Research & Papers

Raon-Speech: 9B-parameter SpeechLM outperforms Qwen2.5-Omni on 42 benchmarks

Open-source 9B speech model understands and generates speech while preserving text capabilities.

Deep Dive

Raon-Speech is a 9B-parameter speech language model (SpeechLM) jointly handling English and Korean speech understanding, generation, and text tasks. Built by transforming a pre-trained LLM into a SpeechLM, it preserves strong text capabilities while adding native speech I/O. The training pipeline consists of three stages: speech module alignment, end-to-end SpeechLM pre-training with knowledge distillation, and multi-task preference optimization post-training. The model was trained on 1.38M hours of highly curated speech and text data, resulting in a unified architecture that can understand, answer, and generate speech in both languages.

On 42 English and Korean benchmarks, Raon-Speech establishes the strongest overall profile among eight similarly sized recent audio foundation models, including Qwen2.5-Omni and FunAudio-Chat, while retaining robust text question answering. The extension, Raon-SpeechChat, enables natural full-duplex conversation by training on 119K hours of time-aligned real and synthetic dialogue data through causal encoder adaptation, full-duplex pre-training, and fine-tuning for voice and role control. It excels in turn-taking and interruption behaviors as measured by FDB v1.0. The authors have open-sourced all model checkpoints, training and inference code, and an interactive demo, making this state-of-the-art speech AI accessible to the community.

Key Points
  • Raon-Speech is a 9B-parameter SpeechLM trained on 1.38M hours of English and Korean speech data
  • Outperforms Qwen2.5-Omni and FunAudio-Chat across 42 speech and text benchmarks
  • Raon-SpeechChat enables full-duplex real-time conversation with turn-taking and interruption handling, trained on 119K hours of dialogue data

Why It Matters

Open-source state-of-the-art speech AI enables natural voice interfaces and real-time conversational agents for developers.