WavCube: Unifying Speech Representation for Understanding and Generation via Semantic-Acoustic Joint Modeling
A single latent representation now handles ASR, TTS, and voice conversion better than specialized models.
WavCube, introduced by researchers from multiple institutions, tackles the long-standing challenge of unifying speech understanding and generation in a single model. Traditional approaches rely on separate representations—semantic features from self-supervised learning (SSL) for understanding, and acoustic features from reconstruction for generation. This fragmentation makes truly unified speech systems difficult. WavCube instead learns a compact continuous latent from an SSL encoder that works across all three tasks: understanding, reconstruction, and generation.
The two-stage training is key. Stage 1 trains a semantic bottleneck to filter noise from raw SSL features, making them tractable for diffusion-based generation. Stage 2 injects fine-grained acoustic details via end-to-end reconstruction, while a semantic anchoring loss keeps the representation grounded in its original semantic manifold. Results show WavCube matches WavLM on the SUPERB benchmark despite an 8x compression, delivers state-of-the-art zero-shot TTS with markedly faster training, and excels in speech enhancement, separation, and voice conversion on SUPERB-SG. This paves the way for more efficient, unified speech AI systems.
- WavCube compresses SSL features 8x while matching WavLM on SUPERB benchmarks for understanding tasks.
- It achieves state-of-the-art zero-shot text-to-speech performance with significantly faster training convergence.
- Excels across multiple speech tasks including enhancement, separation, and voice conversion on the SUPERB-SG benchmark.
Why It Matters
A single, compact speech representation could replace multiple specialized models, enabling simpler and more powerful voice AI systems.