Audio & Speech

Code-switching Speech Recognition Under the Lens: Model- and Data-Centric Perspectives

New AI technique uses LLMs to generate realistic code-switching text, solving a major data scarcity problem.

Deep Dive

A team of researchers from Nanyang Technological University and other institutions has published a comprehensive analysis of code-switching automatic speech recognition (CS-ASR), tackling the unique challenge of transcribing speech that spontaneously mixes multiple languages within a single sentence. The paper, accepted to IEEE TASLP, systematically examines both model-centric approaches—like language-specific processing and multi-task learning—and data-centric strategies. It finds that the scarcity of high-quality, annotated bilingual or multilingual speech data is a fundamental bottleneck, as even high-resource individual languages lack sufficient mixed-language examples for effective AI training.

To overcome this data scarcity, the researchers propose a novel prompting strategy called Simplified Equivalence Constraint Theory (SECT). This method guides large language models (LLMs) to generate text that mimics realistic, grammatically valid code-switching patterns. The generated text is then converted into synthetic speech using text-to-speech (TTS) systems, creating artificial but highly useful training data. The SECT-generated data outperformed existing data augmentation methods in both linguistic quality assessments and downstream ASR performance, proving more effective than simply varying speaker accents or textual characteristics in TTS.

The study's key insight is that effective CS-ASR requires strategies tailored to the specific linguistic characteristics of the code-switching data. By providing a robust framework for generating diverse and realistic training data, the SECT method paves the way for more accurate voice assistants, transcription services, and communication tools in multilingual regions worldwide, where code-switching is the norm rather than the exception.

Key Points
  • Proposes SECT, a new LLM-prompting method to generate realistic code-switching text for training data.
  • Synthetic speech-text pairs created with SECT and TTS improve CS-ASR performance more than prior augmentation techniques.
  • The research provides a dual model- and data-centric analysis, accepted for publication in the IEEE journal TASLP.

Why It Matters

Enables accurate voice AI for billions of multilingual speakers who naturally mix languages, improving global accessibility.