Uses a Mixture-of-Experts Duration Predictor (MoE-DP) to model diverse phoneme durations and speaker-specific speaking rates?

Uses a Mixture-of-Experts Duration Predictor (MoE-DP) to model diverse phoneme durations and speaker-specific speaking rates.

Achieves improvements in synthesis quality, duration accuracy, vocoder reconstruction, and inference speed across three public datasets?

Achieves improvements in synthesis quality, duration accuracy, vocoder reconstruction, and inference speed across three public datasets.

Audio & Speech

FNH-TTS uses Mixture-of-Experts for robust, natural speech synthesis

arXiv eess.AS May 29, 2026

⚡New model captures speaker-dependent speaking rates with MoE duration predictor

Deep Dive

Current non-autoregressive TTS systems struggle to capture diverse, speaker-dependent duration variations — often leading to spectral artifacts when fed to HiFi-GAN vocoders. To address this, researchers propose FNH-TTS, a VITS-based end-to-end system that introduces a Mixture-of-Experts Duration Predictor (MoE-DP). The MoE-DP uses multiple expert networks to learn different phoneme duration patterns and speaking-rate characteristics, dynamically selecting the right expert for each input. This allows the model to handle richer variation without sacrificing synthesis stability. Experiments show MoE-DP is the primary driver of improved duration modeling, enabling more natural prosody and speaker fidelity.

On the vocoder side, FNH-TTS replaces standard HiFi-GAN with a VOCOS-style synthesizer that uses Collaborative Multi-Band and Sub-Band Discriminators. These discriminators enforce accurate time-frequency structure across multiple frequency bands, making the synthesis robust to the greater duration variation produced by MoE-DP. Evaluated on LJSpeech, VCTK, and LibriTTS, FNH-TTS outperforms baselines in synthesis quality, duration-category accuracy, vocoder reconstruction quality, and inference efficiency. The work demonstrates that stronger vocoder-side components are essential to unlock the full potential of richer duration variation in neural TTS.

Key Points

Uses a Mixture-of-Experts Duration Predictor (MoE-DP) to model diverse phoneme durations and speaker-specific speaking rates.
Integrates a VOCOS-style vocoder with Collaborative Multi-Band and Sub-Band Discriminators for stable waveform generation.
Achieves improvements in synthesis quality, duration accuracy, vocoder reconstruction, and inference speed across three public datasets.

Why It Matters

Better duration modeling means more natural, expressive TTS for virtual assistants, audiobooks, and accessibility tools.

Read Original Article

FNH-TTS uses Mixture-of-Experts for robust, natural speech synthesis

Why It Matters

Related Articles

🚀 Stay Ahead in AI