VibeVoice-ASR processes 60-minute audio in a single pass with speaker diarization and timestamps?

VibeVoice-ASR processes 60-minute audio in a single pass with speaker diarization and timestamps

VibeVoice-TTS generates up to 90 minutes of multi-speaker speech with up to 4 distinct speakers?

VibeVoice-TTS generates up to 90 minutes of multi-speaker speech with up to 4 distinct speakers

Both models use a 7.5 Hz continuous speech tokenizer and support 50+ languages?

Both models use a 7.5 Hz continuous speech tokenizer and support 50+ languages

Developer Tools

Microsoft open-sources VibeVoice: 60-min ASR & 90-min TTS

Hacker News April 28, 2026

⚡Process hour-long audio in a single pass with speaker diarization and 50+ languages.

Deep Dive

Microsoft has open-sourced VibeVoice, a family of frontier voice AI models that includes both Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) capabilities. The VibeVoice-ASR model is a unified speech-to-text system designed to process up to 60 minutes of continuous audio in a single pass, generating structured transcriptions that include speaker identification (Who), timestamps (When), and content (What). It supports over 50 languages and allows users to provide customized hotwords—specific names, technical terms, or background context—to improve recognition accuracy on domain-specific content. The model leverages a 7.5 Hz continuous speech tokenizer (acoustic and semantic) that preserves audio fidelity while enabling efficient processing of long sequences.

On the TTS side, VibeVoice-TTS can synthesize speech up to 90 minutes long in a single pass, supporting up to 4 distinct speakers with natural turn-taking and emotional nuances. A smaller real-time variant, VibeVoice-Realtime-0.5B, supports streaming text input and multilingual voices across nine languages (German, French, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Spanish) plus 11 English style voices. Both models use a next-token diffusion framework that combines a large language model for contextual understanding with a diffusion head for high-fidelity audio generation. The models are available on Hugging Face, with finetuning code and vLLM inference support for faster deployment.

Key Points

VibeVoice-ASR processes 60-minute audio in a single pass with speaker diarization and timestamps
VibeVoice-TTS generates up to 90 minutes of multi-speaker speech with up to 4 distinct speakers
Both models use a 7.5 Hz continuous speech tokenizer and support 50+ languages

Why It Matters

Microsoft's open-source voice AI democratizes long-form speech processing, enabling scalable transcription and synthesis for podcasts, meetings, and accessibility tools.

Read Original Article

Microsoft open-sources VibeVoice: 60-min ASR & 90-min TTS

Why It Matters

Related Articles

🚀 Stay Ahead in AI