Audio & Speech

Synthetic Data Domain Adaptation for ASR via LLM-based Text and Phonetic Respelling Augmentation

A new AI framework uses LLMs to generate realistic pronunciation variations, making speech recognition more robust.

Deep Dive

A team of researchers has developed a new method to make Automatic Speech Recognition (ASR) systems perform better in specialized domains like medicine or law, where training data is scarce. Their framework tackles the core problem: end-to-end ASR models often fail when faced with domain-specific vocabulary and pronunciations not present in their general training data. The solution is a two-part synthetic data generation pipeline powered by Large Language Models (LLMs).

First, the LLM generates and filters domain-relevant text, ensuring a balance of lexical diversity and term coverage. The key innovation is the second stage: Phonetic Respelling Augmentation (PRA). Instead of applying variability only at the acoustic level (like SpecAugment), PRA uses an LLM to create orthographic pseudo-spellings (e.g., 'fotosynthesis' for 'photosynthesis') that represent realistic pronunciation variations. This phonetic diversity is baked into the text *before* it is converted to synthetic speech, resulting in audio training data that better mimics the real-world messiness of human speech.

The results are concrete. The team validated their approach across four different domain-specific datasets, and the framework delivered consistent reductions in Word Error Rate (WER). This demonstrates that improving ASR robustness isn't just about having more domain words, but about accurately modeling how those words are actually spoken with all their natural variation. The method provides a more efficient path to domain adaptation compared to costly manual data collection.

Key Points
  • Uses LLMs for a two-stage pipeline: domain-text generation & novel Phonetic Respelling Augmentation (PRA).
  • PRA introduces pronunciation variability via pseudo-spellings *before* speech synthesis, unlike post-hoc acoustic methods.
  • Consistently reduced Word Error Rate across four specialized datasets, proving enhanced ASR robustness.

Why It Matters

Enables more accurate voice assistants and transcription tools in specialized fields without expensive, real data collection.