WaveNeXt 2 unifies GAN and diffusion vocoders, trains in 32 hours
New ConvNeXt-based vocoder trains 2.5x faster than FastDiff, beats HiFi-GAN speed.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
WaveNeXt 2, presented at ICASSP 2026 by researchers from NICT and JAIST, is a unified neural vocoder framework built on ConvNeXt blocks. It tackles a key limitation of prior work: most vocoders are locked into either GAN or diffusion paradigms. State-of-the-art models like Vocos and the original WaveNeXt only operate as GANs and struggle with multi-speaker scenarios. Diffusion vocoders, while faster to train, suffer from slow CPU inference. WaveNeXt 2 introduces residual denoising and sub-modeling, where multiple sub-models sequentially refine the waveform, enabling a single architecture to work seamlessly with both GAN and diffusion training.
Experimental results on multi-speaker datasets confirm the performance gains. The GAN variant (GAN-WaveNeXt 2) is substantially faster than HiFi-GAN and WaveFit. The diffusion variant (Diff-WaveNeXt 2) achieves inference speed superior to FastDiff with just 4 sampling steps, while maintaining competitive synthesis quality. Crucially, Diff-WaveNeXt 2 trains in only 32 hours, making it highly practical for teams with limited compute. This efficiency, combined with its dual-mode flexibility, positions WaveNeXt 2 as a strong contender for real-time speech synthesis in resource-constrained environments.
- WaveNeXt 2 uses residual denoising and sub-modeling to unify GAN and diffusion vocoder architectures.
- GAN-WaveNeXt 2 is much faster than HiFi-GAN and WaveFit; Diff-WaveNeXt 2 beats FastDiff 4-step inference speed.
- Diff-WaveNeXt 2 trains in only 32 hours, ideal for resource-constrained applications.
Why It Matters
Faster, unified neural vocoders that train in 32 hours enable high-quality speech synthesis on limited hardware.