Developer Tools

Microsoft VibeVoice: Open-Source Frontier Voice AI

Process hour-long audio in a single pass with speaker diarization and 50+ languages.

Deep Dive

Microsoft has open-sourced VibeVoice, a family of frontier voice AI models that includes both Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) capabilities. The VibeVoice-ASR model is a unified speech-to-text system designed to process up to 60 minutes of continuous audio in a single pass, generating structured transcriptions that include speaker identification (Who), timestamps (When), and content (What). It supports over 50 languages and allows users to provide customized hotwords—specific names, technical terms, or background context—to improve recognition accuracy on domain-specific content. The model leverages a 7.5 Hz continuous speech tokenizer (acoustic and semantic) that preserves audio fidelity while enabling efficient processing of long sequences.

On the TTS side, VibeVoice-TTS can synthesize speech up to 90 minutes long in a single pass, supporting up to 4 distinct speakers with natural turn-taking and emotional nuances. A smaller real-time variant, VibeVoice-Realtime-0.5B, supports streaming text input and multilingual voices across nine languages (German, French, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Spanish) plus 11 English style voices. Both models use a next-token diffusion framework that combines a large language model for contextual understanding with a diffusion head for high-fidelity audio generation. The models are available on Hugging Face, with finetuning code and vLLM inference support for faster deployment.

Key Points
  • VibeVoice-ASR processes 60-minute audio in a single pass with speaker diarization and timestamps
  • VibeVoice-TTS generates up to 90 minutes of multi-speaker speech with up to 4 distinct speakers
  • Both models use a 7.5 Hz continuous speech tokenizer and support 50+ languages

Why It Matters

Microsoft's open-source voice AI democratizes long-form speech processing, enabling scalable transcription and synthesis for podcasts, meetings, and accessibility tools.