Research & Papers

Improving quality and robustness in LLM-based text-to-speech systems

Amazon's research combines LoRA, data augmentation, and chain-of-thought reasoning to solve major LLM-based speech synthesis flaws.

Deep Dive

Amazon's AI research team has unveiled a comprehensive solution to three persistent problems plaguing modern LLM-based text-to-speech systems: accent leakage in multilingual voice cloning, robotic expressiveness, and unreliable generation that leads to hallucinations or premature cutoffs. Their approach uses low-rank adaptation (LoRA) to fine-tune polyglot models on locale-specific data, effectively separating speaker identity from accent. This allows a voice cloned from English audio to speak Spanish or German with native pronunciation while preserving the original speaker's vocal characteristics.

To tackle expressiveness, the team employs classifier-free guidance (CFG) to generate synthetic reference audio with enhanced emotional cues like laughs and sighs, teaching the model to adopt more natural prosody. For reliability—a critical weakness where autoregressive models can generate confident-sounding nonsense or stop mid-sentence—they implement chain-of-thought reasoning. Before generating speech tokens, the model first predicts phoneme sequences and estimates durations, creating an explicit plan that prevents hallucinations and truncations.

The results are measurable: in MUSHRA listening tests across nine locales including US-English, Germany-German, and Spain-Spanish, the new system showed quality improvements ranging from 5.5% to over 20% compared to previous models. This represents a significant step toward truly robust, expressive, and accent-free polyglot TTS that can scale a handful of recorded voices to dozens of languages while maintaining naturalness and reliability.

Key Points
  • Uses LoRA fine-tuning and locale-specific data augmentation to eliminate accent leakage in polyglot voice cloning, preserving speaker identity across languages.
  • Implements classifier-free guidance (CFG) to boost expressiveness with emotional cues, and chain-of-thought reasoning to prevent hallucinations and premature cutoffs.
  • Achieved 5-20% quality improvements in MUSHRA tests across nine locales, making LLM-based TTS more reliable for professional voice cloning applications.

Why It Matters

Enables truly global voice cloning for media, customer service, and accessibility tools without accent artifacts or unreliable outputs.