New TTS technique enables fine-grained speaking style control within utterances
Achieves 99-100% gender conversion and smooth style transitions in real-time.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
Researchers Jaehoon Kang, Yejin Lee, Yoonji Park, and Kyuhong Shim from Seoul National University have unveiled techniques to unlock fine-grained speaking style control in prompt-based text-to-speech (TTS) models. Their work addresses two key limitations: the inability to smoothly interpolate styles between different utterances, and the lack of within-utterance style transitions.
For inter-utterance control, the team computes direction vectors between contrastive style prompts in the embedding space and applies simple interpolation, enabling smooth gender conversion (99-100% success rate), pitch adjustments up to 36 Hz, and speed changes up to 1.6 syllables per second. For intra-utterance transitions, they identified a strong attention bias toward early tokens in autoregressive TTS decoders. To mitigate this, they introduced KV-cache swapping and sliding-window attention masking, which allow the model to change speaking style mid-sentence without abrupt glitches. Experiments show speaker similarity remains high (0.81-0.91) and perceptual smoothness scores reach 3.48-4.48 (on a 1-5 scale), demonstrating the method's practicality for real-time voice modulation.
- Inter-utterance style interpolation via direction vectors achieves 99-100% gender conversion success and up to 36 Hz pitch variation.
- Intra-utterance style transitions enabled by KV-cache swapping and sliding-window attention masking overcome autoregressive decoder bias.
- Maintains high speaker similarity (0.81-0.91) and perceptual smoothness (3.48-4.48) during dynamic style changes.
Why It Matters
Enables natural, dynamic voice modulation for AI assistants, audiobooks, and real-time dubbing applications.