SwanVoice handles zero-shot synthesis for 1–4 speakers, maintaining acoustic and emotional coherence across dialogue turns?

SwanVoice handles zero-shot synthesis for 1–4 speakers, maintaining acoustic and emotional coherence across dialogue turns.

Uses a 25 Hz VAE, pause-aware symbols + pinyin substitution, and a flow-matching DiT with speaker-turn conditioning?

Uses a 25 Hz VAE, pause-aware symbols + pinyin substitution, and a flow-matching DiT with speaker-turn conditioning.

Post-trained with DiffusionNFT using phone-level and speaker-similarity rewards; outperforms open-source baselines on SwanBench-Speech?

Post-trained with DiffusionNFT using phone-level and speaker-similarity rewards; outperforms open-source baselines on SwanBench-Speech.

Audio & Speech

SwanVoice masters expressive long-form speech for monologue and dialogue

arXiv eess.AS June 01, 2026

⚡Zero-shot TTS handles 1-4 speakers with pause-aware alignment and flow-matching DiT...

Deep Dive

Traditional zero-shot TTS works well for single-speaker synthesis but struggles with expressive long-form multi-speaker dialogue. Common workarounds stitch together separate monologue outputs, which breaks acoustic consistency, conversational coherence, and affective continuity across turns. SwanVoice solves this by jointly modeling monologue and dialogue from the ground up. Built on SwanData-Speech, a corpus created from in-the-wild audio using Swan Forced Aligner for pause-aware word-level alignment and RobustMegaTTS3 for hard pronunciation cases, the model can generate coherent speech for 1 to 4 speakers without stitching.

Architecturally, SwanVoice combines a 25 Hz VAE, raw-text conditioning with pause-aware symbols and pinyin substitution, and a flow-matching diffusion transformer (DiT) with speaker-turn conditioning. Training proceeds from monologue speech through mixed and real dialogue data, then uses DiffusionNFT post-training with phone-level and speaker-similarity rewards to refine expressiveness. On SwanBench-Speech, SwanVoice achieves higher richness and hierarchy scores than all evaluated open-source baselines for both monologue and dialogue. Content accuracy remains the main limitation, but the approach marks a significant step toward truly natural conversational AI speech synthesis.

Key Points

SwanVoice handles zero-shot synthesis for 1–4 speakers, maintaining acoustic and emotional coherence across dialogue turns.
Uses a 25 Hz VAE, pause-aware symbols + pinyin substitution, and a flow-matching DiT with speaker-turn conditioning.
Post-trained with DiffusionNFT using phone-level and speaker-similarity rewards; outperforms open-source baselines on SwanBench-Speech.

Why It Matters

Enables natural, coherent multi-speaker dialogue synthesis without stitching, unlocking better AI voice assistants and audiobook narration.

Read Original Article

SwanVoice masters expressive long-form speech for monologue and dialogue

Why It Matters

Related Articles

🚀 Stay Ahead in AI