Audio & Speech

OneVoice unifies voice conversion across speech, expression, and singing

A single MoE model matches specialized VC systems in all three scenarios

Deep Dive

Recent voice conversion (VC) research has hit new milestones in speaker cloning and linguistic preservation, yet the field remains fragmented—requiring separate models for speech, expressive, and singing tasks. The paper 'OneVoice: One Model, Triple Scenarios—Towards Unified Zero-shot Voice Conversion' (arXiv:2601.18094) by Zhichao Wang and five co-authors tackles this head-on with a single unified framework. OneVoice is built on a continuous language model trained with VAE-free next-patch diffusion, delivering high fidelity and efficient sequence modeling. Its core innovation is a Mixture-of-Experts (MoE) architecture with a dual-path routing mechanism: one path isolates shared conversion knowledge across all scenarios, while the other assigns scenario-specific domain experts guided by global-local cues. Additionally, scenario-specific prosodic features are fused into each layer via a gated mechanism, letting the model adaptively use prosody information for precise conditioning.

To address the imbalance between abundant speech data and scarce singing data, the authors adopt a two-stage progressive training strategy. First, they perform foundational pre-training on large-scale speech corpora, then enhance scenario-specific performance using LoRA-based domain experts that fine-tune only small parameter sets per scenario. Experiments show that OneVoice matches or surpasses specialized models across all three scenarios (linguistic-preserving, expressive, and singing). The model also offers flexible control over scenarios and includes a fast decoding variant that generates output in as few as 2 diffusion steps—a significant speed advantage for real-time applications. The demo page provides audio samples for evaluation. This work demonstrates that a single, well-designed architecture can replace multiple specialized VC systems, hinting at a future where voice conversion becomes as plug-and-play as text-to-speech.

Key Points
  • OneVoice uses a Mixture-of-Experts (MoE) with dual-path routing to handle shared conversion knowledge and scenario-specific expressivity in one model.
  • Scenario-specific prosodic features are fused via a gated mechanism, allowing adaptive conditioning for speech, expressive, and singing tasks.
  • Two-stage progressive training with LoRA-based domain experts overcomes data imbalance, and fast decoding achieves generation in as few as 2 steps.

Why It Matters

One model replaces three specialized voice converters—saving development costs and enabling seamless multi-scenario deployment.