Inter-utterance style interpolation via direction vectors achieves 99-100% gender conversion success and up to 36 Hz pitch variation?

Inter-utterance style interpolation via direction vectors achieves 99-100% gender conversion success and up to 36 Hz pitch variation.

Maintains high speaker similarity (0.81-0.91) and perceptual smoothness (3.48-4.48) during dynamic style changes?

Maintains high speaker similarity (0.81-0.91) and perceptual smoothness (3.48-4.48) during dynamic style changes.

Research & Papers

New TTS technique enables fine-grained speaking style control within utterances

arXiv cs.CL May 28, 2026

⚡Achieves 99-100% gender conversion and smooth style transitions in real-time.

Deep Dive

Researchers Jaehoon Kang, Yejin Lee, Yoonji Park, and Kyuhong Shim from Seoul National University have unveiled techniques to unlock fine-grained speaking style control in prompt-based text-to-speech (TTS) models. Their work addresses two key limitations: the inability to smoothly interpolate styles between different utterances, and the lack of within-utterance style transitions.

For inter-utterance control, the team computes direction vectors between contrastive style prompts in the embedding space and applies simple interpolation, enabling smooth gender conversion (99-100% success rate), pitch adjustments up to 36 Hz, and speed changes up to 1.6 syllables per second. For intra-utterance transitions, they identified a strong attention bias toward early tokens in autoregressive TTS decoders. To mitigate this, they introduced KV-cache swapping and sliding-window attention masking, which allow the model to change speaking style mid-sentence without abrupt glitches. Experiments show speaker similarity remains high (0.81-0.91) and perceptual smoothness scores reach 3.48-4.48 (on a 1-5 scale), demonstrating the method's practicality for real-time voice modulation.

Key Points

Inter-utterance style interpolation via direction vectors achieves 99-100% gender conversion success and up to 36 Hz pitch variation.
Intra-utterance style transitions enabled by KV-cache swapping and sliding-window attention masking overcome autoregressive decoder bias.
Maintains high speaker similarity (0.81-0.91) and perceptual smoothness (3.48-4.48) during dynamic style changes.

Why It Matters

Enables natural, dynamic voice modulation for AI assistants, audiobooks, and real-time dubbing applications.

Read Original Article

New TTS technique enables fine-grained speaking style control within utterances

Why It Matters

Related Articles

🚀 Stay Ahead in AI