Researchers bridge speech AI gap, beat ElevenLabs with self-alignment
Synthetic data makes speech AI boring — but two new frameworks fix that.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
Spoken Language Models (SLMs) typically rely on synthetic data to scale to low-resource languages, but this introduces a fundamental trade-off dubbed the Stability-Expressivity Gap: while synthetic data boosts phonetic accuracy, it progressively suppresses prosodic variability, leading to a collapse of expressiveness (Synthetic Erosion). The paper shows that existing commercial systems suffer from this issue, making synthetic speech sound flat and robotic.
To bridge this gap, the authors propose two self-alignment frameworks. Disentanglement-Guided Self-Alignment (DGSA) recovers expressivity by separating prosody and timbre representations, enabling nuanced speech patterns even in complex languages. For extremely low-resource scenarios with limited authentic references, Temperature-Driven Self-Critique (TDSC) uses automated exploration and filtering to stabilize generation. The combined approach outperforms ElevenLabs and Gemini Pro in both objective and subjective evaluations, and notably enables the first zero-shot voice cloning capability for Lao — a language with minimal transcribed speech data.
- DGSA recovers prosodic variability by disentangling prosody and timbre in speech representations.
- TDSC uses automated self-critique and temperature-based sampling to stabilize generation with very few real examples.
- The method beats ElevenLabs and Gemini Pro, and achieves first-ever zero-shot voice cloning for Lao.
Why It Matters
Enables expressive, natural-sounding speech AI for dozens of underserved languages, democratizing voice technology.