SelfTTS: cross-speaker style transfer through explicit embedding disentanglement and self-refinement using self-augmentation
New model disentangles speaker identity from emotion, achieving superior naturalness scores without external encoders.
A research team led by Lucas H. Ueda has unveiled SelfTTS, a novel text-to-speech architecture designed to transfer emotional speaking styles between different voices. The core innovation is its ability to perform this 'cross-speaker style transfer' without relying on any external, pre-trained models for speaker or emotion recognition. Instead, SelfTTS internally disentangles the two components—who is speaking and how they are speaking—using a combination of Gradient Reversal Layers (GRL) and a cosine similarity loss function. This forces the model to learn separate, clean representations for speaker identity and emotional content directly from the data.
To further refine these representations, the team introduced Multi Positive Contrastive Learning (MPCL), which clusters embeddings based on their speaker and emotion labels, enhancing the model's discrimination power. Perhaps the most clever aspect is the self-refinement stage, where the model uses its own voice conversion capabilities to generate augmented training data, effectively teaching itself to produce more natural-sounding speech. Submitted to Interspeech 2026, the paper reports that SelfTTS achieves superior scores in emotional naturalness (eMOS) and demonstrates more robust stability in preserving both the target speaker's timbre and the intended emotion compared to existing state-of-the-art methods.
- Eliminates dependency on external pre-trained speaker or emotion encoders, creating a more self-contained system.
- Uses Gradient Reversal Layers and Multi Positive Contrastive Learning for explicit disentanglement of speaker and style embeddings.
- Employs a self-augmentation refinement strategy, leveraging its own voice conversion to improve synthesized speech naturalness.
Why It Matters
Simplifies and improves emotional voice cloning for content creation, accessibility tools, and interactive media without complex model pipelines.