Raon-Speech: 9B-parameter SpeechLM outperforms Qwen2.5-Omni on 42 benchmarks
Open-source 9B speech model understands and generates speech while preserving text capabilities.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
Raon-Speech is a 9B-parameter speech language model (SpeechLM) jointly handling English and Korean speech understanding, generation, and text tasks. Built by transforming a pre-trained LLM into a SpeechLM, it preserves strong text capabilities while adding native speech I/O. The training pipeline consists of three stages: speech module alignment, end-to-end SpeechLM pre-training with knowledge distillation, and multi-task preference optimization post-training. The model was trained on 1.38M hours of highly curated speech and text data, resulting in a unified architecture that can understand, answer, and generate speech in both languages.
On 42 English and Korean benchmarks, Raon-Speech establishes the strongest overall profile among eight similarly sized recent audio foundation models, including Qwen2.5-Omni and FunAudio-Chat, while retaining robust text question answering. The extension, Raon-SpeechChat, enables natural full-duplex conversation by training on 119K hours of time-aligned real and synthetic dialogue data through causal encoder adaptation, full-duplex pre-training, and fine-tuning for voice and role control. It excels in turn-taking and interruption behaviors as measured by FDB v1.0. The authors have open-sourced all model checkpoints, training and inference code, and an interactive demo, making this state-of-the-art speech AI accessible to the community.
- Raon-Speech is a 9B-parameter SpeechLM trained on 1.38M hours of English and Korean speech data
- Outperforms Qwen2.5-Omni and FunAudio-Chat across 42 speech and text benchmarks
- Raon-SpeechChat enables full-duplex real-time conversation with turn-taking and interruption handling, trained on 119K hours of dialogue data
Why It Matters
Open-source state-of-the-art speech AI enables natural voice interfaces and real-time conversational agents for developers.