Achieves 74ms first-packet latency and processes audio 4x faster than real-time on consumer GPUs?

Achieves 74ms first-packet latency and processes audio 4x faster than real-time on consumer GPUs

Uses textless audio prompting via prompt-text masking, eliminating transcription requirements for voice samples?

Uses textless audio prompting via prompt-text masking, eliminating transcription requirements for voice samples

Audio & Speech

VoXtream2 TTS model enables real-time, controllable speech with 74ms latency

arXiv eess.AS March 17, 2026

⚡New AI speech model generates audio 4x faster than real-time and can change speaking rate mid-sentence.

Deep Dive

A research team from KTH Royal Institute of Technology has unveiled VoXtream2, a breakthrough in real-time speech synthesis technology. The model addresses a critical challenge in interactive AI systems: generating natural-sounding speech with minimal delay while maintaining control over delivery. Unlike traditional TTS that requires complete text input, VoXtream2 operates in "full-stream" mode, starting to speak after receiving just the first words and continuing as more text arrives. This architecture enables a remarkably low 74ms first-packet latency and processes audio four times faster than real-time on standard consumer GPUs.

VoXtream2 introduces several technical innovations that enhance its practical utility. The model implements dynamic speaking-rate control that can be adjusted mid-utterance, allowing applications to speed up or slow down speech delivery on the fly based on context or user preferences. It also features "prompt-text masking" for textless audio prompting, eliminating the need for transcription when providing voice samples. Despite using a smaller architecture and less training data than comparable models, VoXtream2 achieves competitive results on standard zero-shot benchmarks and dedicated speaking-rate tests.

The researchers employed a combination of distribution matching over duration states and classifier-free guidance across conditioning signals to improve both controllability and synthesis quality. This approach allows the model to maintain natural prosody while offering unprecedented real-time control parameters. The technology has been submitted to Interspeech 2026 and represents a significant step toward more responsive and adaptable voice interfaces for applications ranging from live translation services to interactive gaming and accessibility tools.

Key Points

Achieves 74ms first-packet latency and processes audio 4x faster than real-time on consumer GPUs
Enables dynamic speaking-rate control that can be adjusted mid-utterance without restarting speech
Uses textless audio prompting via prompt-text masking, eliminating transcription requirements for voice samples

Why It Matters

Enables truly responsive voice interfaces for live translation, gaming, and accessibility tools with unprecedented control.

Read Original Article

VoXtream2 TTS model enables real-time, controllable speech with 74ms latency

Why It Matters

Related Articles

🚀 Stay Ahead in AI