Audio & Speech

VoXtream2: Full-stream TTS with dynamic speaking rate control

New AI speech model generates audio 4x faster than real-time and can change speaking rate mid-sentence.

Deep Dive

A research team from KTH Royal Institute of Technology has unveiled VoXtream2, a breakthrough in real-time speech synthesis technology. The model addresses a critical challenge in interactive AI systems: generating natural-sounding speech with minimal delay while maintaining control over delivery. Unlike traditional TTS that requires complete text input, VoXtream2 operates in "full-stream" mode, starting to speak after receiving just the first words and continuing as more text arrives. This architecture enables a remarkably low 74ms first-packet latency and processes audio four times faster than real-time on standard consumer GPUs.

VoXtream2 introduces several technical innovations that enhance its practical utility. The model implements dynamic speaking-rate control that can be adjusted mid-utterance, allowing applications to speed up or slow down speech delivery on the fly based on context or user preferences. It also features "prompt-text masking" for textless audio prompting, eliminating the need for transcription when providing voice samples. Despite using a smaller architecture and less training data than comparable models, VoXtream2 achieves competitive results on standard zero-shot benchmarks and dedicated speaking-rate tests.

The researchers employed a combination of distribution matching over duration states and classifier-free guidance across conditioning signals to improve both controllability and synthesis quality. This approach allows the model to maintain natural prosody while offering unprecedented real-time control parameters. The technology has been submitted to Interspeech 2026 and represents a significant step toward more responsive and adaptable voice interfaces for applications ranging from live translation services to interactive gaming and accessibility tools.

Key Points
  • Achieves 74ms first-packet latency and processes audio 4x faster than real-time on consumer GPUs
  • Enables dynamic speaking-rate control that can be adjusted mid-utterance without restarting speech
  • Uses textless audio prompting via prompt-text masking, eliminating transcription requirements for voice samples

Why It Matters

Enables truly responsive voice interfaces for live translation, gaming, and accessibility tools with unprecedented control.