Audio & Speech

LASE: Language-Adversarial Speaker Encoding for Indic Cross-Script Identity Preservation

New adversarial encoder eliminates language leakage in multilingual voice cloning, improving similarity by 10x.

Deep Dive

A persistent challenge in multilingual voice cloning is that speaker encoders fail to recognize the same voice when the input text is in a different script — a problem known as cross-script identity collapse. Standard models like WavLM-base-plus-sv and ECAPA-TDNN show cosine similarity drops of 0.082 and 0.105 respectively when a Western-accented speaker switches between English, Hindi, Telugu, or Tamil. The gap is smaller for Indian-accented voices (0.006 and 0.044) but remains problematic for cross-script TTS systems that project non-Indic-trained voices into Indic scripts.

To address this, researchers present LASE (Language-Adversarial Speaker Encoder), a lightweight projection head that sits on top of a frozen WavLM-base-plus backbone. LASE is trained with a supervised contrastive loss for speaker identity and a gradient-reversal layer that forces the embedding to be uninformative about language. After training on 1,118 quality-gated cross-script pairs synthesized from eight commercial multilingual voices, LASE reduces the residual gap to effectively zero (0.013 Western, 0.026 Indian, both with 95% confidence intervals including zero). It also boosts the cross-script-vs-floor margin by 2.4–2.7× over baselines. In a multi-speaker diarization task, LASE matches ECAPA-TDNN recall (0.788 vs 0.789) while using roughly 100× less training data. The checkpoint, corpora, and evaluation scripts are publicly released.

Key Points
  • Existing WavLM and ECAPA-TDNN lose 0.082 and 0.105 cosine similarity when the same voice switches between scripts on a Western-accented corpus.
  • LASE reduces cross-script similarity gap to near zero (0.013 Western, 0.026 Indian) using gradient-reversal language adversarial training.
  • LASE amplifies cross-script margin 2.4–2.7× over baselines and matches ECAPA-TDNN diarization recall with ~100× less training data.

Why It Matters

Enables consistent voice cloning across Indic scripts, critical for multilingual TTS and speaker diarization systems.