WaveNeXt 2 uses residual denoising and sub-modeling to unify GAN and diffusion vocoder architectures?

WaveNeXt 2 uses residual denoising and sub-modeling to unify GAN and diffusion vocoder architectures.

GAN-WaveNeXt 2 is much faster than HiFi-GAN and WaveFit; Diff-WaveNeXt 2 beats FastDiff 4-step inference speed?

GAN-WaveNeXt 2 is much faster than HiFi-GAN and WaveFit; Diff-WaveNeXt 2 beats FastDiff 4-step inference speed.

Diff-WaveNeXt 2 trains in only 32 hours, ideal for resource-constrained applications?

Diff-WaveNeXt 2 trains in only 32 hours, ideal for resource-constrained applications.

Audio & Speech

WaveNeXt 2 unifies GAN and diffusion vocoders, trains in 32 hours

arXiv eess.AS May 26, 2026

⚡New ConvNeXt-based vocoder trains 2.5x faster than FastDiff, beats HiFi-GAN speed.

Deep Dive

WaveNeXt 2, presented at ICASSP 2026 by researchers from NICT and JAIST, is a unified neural vocoder framework built on ConvNeXt blocks. It tackles a key limitation of prior work: most vocoders are locked into either GAN or diffusion paradigms. State-of-the-art models like Vocos and the original WaveNeXt only operate as GANs and struggle with multi-speaker scenarios. Diffusion vocoders, while faster to train, suffer from slow CPU inference. WaveNeXt 2 introduces residual denoising and sub-modeling, where multiple sub-models sequentially refine the waveform, enabling a single architecture to work seamlessly with both GAN and diffusion training.

Experimental results on multi-speaker datasets confirm the performance gains. The GAN variant (GAN-WaveNeXt 2) is substantially faster than HiFi-GAN and WaveFit. The diffusion variant (Diff-WaveNeXt 2) achieves inference speed superior to FastDiff with just 4 sampling steps, while maintaining competitive synthesis quality. Crucially, Diff-WaveNeXt 2 trains in only 32 hours, making it highly practical for teams with limited compute. This efficiency, combined with its dual-mode flexibility, positions WaveNeXt 2 as a strong contender for real-time speech synthesis in resource-constrained environments.

Key Points

WaveNeXt 2 uses residual denoising and sub-modeling to unify GAN and diffusion vocoder architectures.
GAN-WaveNeXt 2 is much faster than HiFi-GAN and WaveFit; Diff-WaveNeXt 2 beats FastDiff 4-step inference speed.
Diff-WaveNeXt 2 trains in only 32 hours, ideal for resource-constrained applications.

Why It Matters

Faster, unified neural vocoders that train in 32 hours enable high-quality speech synthesis on limited hardware.

Read Original Article

WaveNeXt 2 unifies GAN and diffusion vocoders, trains in 32 hours

Why It Matters

Related Articles

🚀 Stay Ahead in AI