On The Landscape of Spoken Language Models: A Comprehensive Survey
A new 10-author survey charts the shift from task-specific speech models to universal SLMs, similar to the text LLM revolution.
A consortium of ten leading researchers from institutions like MIT, Carnegie Mellon University, and National Taiwan University has published a seminal survey titled 'On The Landscape of Spoken Language Models.' Published on arXiv (ID: 2504.08528), the paper provides a unifying framework for understanding Spoken Language Models (SLMs), which are emerging as the speech equivalent of large language models (LLMs) like GPT-4. The authors argue the field is undergoing a paradigm shift, moving away from training narrow, task-specific models for speech recognition or synthesis and toward developing universal SLMs that act as foundational systems for diverse speech processing tasks.
The survey meticulously categorizes the diverse and rapidly expanding work in SLMs, focusing on three core dimensions: model architecture, training methodologies, and evaluation benchmarks. It distinguishes between 'pure' speech language models that learn the distribution of tokenized audio sequences and hybrid models that combine speech encoders with text LLMs. A key contribution is its effort to standardize the fragmented terminology and evaluation settings across research papers, which currently hinders clear comparison and progress. The paper concludes by outlining significant challenges—such as efficient training on massive, multimodal datasets—and critical future directions for the field, aiming to serve as a foundational reference for both new and established researchers.
- The survey defines and categorizes Spoken Language Models (SLMs), framing them as universal, foundational systems for speech processing, analogous to text LLMs.
- It systematically reviews work by architecture (pure speech LMs vs. speech-text hybrids), training methods, and evaluation, aiming to unify disparate terminology across the field.
- Authored by 10 experts including Hung-Yi Lee and Shinji Watanabe, it identifies key challenges like scalable multimodal training and sets a roadmap for future SLM research.
Why It Matters
This survey provides a crucial roadmap for developing the next generation of AI that understands and generates speech as fluently as text.