Sommelier: Scalable Open Multi-turn Audio Pre-processing for Full-duplex Speech Language Models
New open-source tool creates high-quality training data for AI that can talk and listen simultaneously.
A team of researchers, including Kyudan Jung, Jihwan Kim, and Jaegul Choo, has introduced Sommelier, a scalable open-source pipeline for pre-processing audio data to train next-generation Speech Language Models (SLMs). As AI shifts from text-based LLMs like GPT-4 to conversational speech models, a major bottleneck has been the lack of high-quality training data that captures the messy reality of human conversation—including interruptions, simultaneous talk (overlapping speech), and listener feedback (back-channeling). Existing large datasets are mostly single-speaker or too limited, and standard automatic speech recognition (ASR) and speaker diarization tools often introduce errors and hallucinations when processing such complex audio. Sommelier is designed to robustly clean and prepare this crucial data.
The technical paper (arXiv:2603.25750) details a 34-page methodology with 7 figures and 11 tables, outlining a pipeline that can scale to process the vast amounts of multi-turn dialogue needed. By providing an open-source tool to generate better training data, Sommelier directly enables the development of full-duplex AI systems. These are models that can listen and speak in real-time, enabling truly natural and fluid human-computer interaction, moving beyond the rigid turn-taking of current voice assistants. The release of the pipeline to the community aims to accelerate progress in the nascent but critical field of conversational AI.
- Open-source pipeline named Sommelier creates training data for full-duplex Speech Language Models (SLMs).
- Solves the critical data scarcity for multi-speaker conversations with overlapping speech and back-channeling.
- Aims to move AI beyond turn-taking assistants to enable real-time, natural human-computer dialogue.
Why It Matters
This foundational work unlocks the data needed to build AI assistants that can have fluid, natural conversations, transforming human-computer interaction.