Audio & Speech

X-VC: Zero-shot Streaming Voice Conversion in Codec Space

New system converts voices in real-time with low latency, achieving best-in-class streaming WER scores.

Deep Dive

A research team from Shanghai Jiao Tong University and other institutions has introduced X-VC, a novel system for zero-shot streaming voice conversion. The technology allows real-time transformation of a speaker's voice to match an unseen target speaker's characteristics while preserving linguistic content, all with minimal latency. X-VC operates by performing a one-step conversion directly within the latent space of a pre-trained neural audio codec, a significant architectural shift from previous multi-stage approaches.

Key to its performance is a dual-conditioning acoustic converter that models both source audio latents and frame-level acoustic conditions from target reference speech. The system injects utterance-level speaker information through adaptive normalization and employs a unique training strategy using generated paired data with role-assignment modes. For streaming applications, X-VC uses a chunkwise inference scheme with overlap smoothing aligned with the codec's segment-based training, enabling real-time processing.

In benchmark testing on the Seed-TTS-Eval dataset, X-VC achieved the best streaming Word Error Rate (WER) scores in both English and Chinese, demonstrating strong speaker similarity in same-language and cross-lingual scenarios. The system also showed substantially lower offline real-time factor compared to baseline models, indicating significantly faster processing capabilities. These results position codec-space one-step conversion as a practical approach for building high-quality, low-latency voice conversion systems suitable for interactive use cases.

Key Points
  • Achieves best streaming Word Error Rate (WER) on Seed-TTS-Eval benchmark in English and Chinese
  • Uses one-step conversion in neural codec latent space with dual-conditioning acoustic converter
  • Enables real-time processing with chunkwise inference and substantially lower offline real-time factor

Why It Matters

Enables real-time voice cloning for interactive applications like live translation, gaming, and content creation with professional-grade quality.