SwanVoice masters expressive long-form speech for monologue and dialogue
Zero-shot TTS handles 1-4 speakers with pause-aware alignment and flow-matching DiT...
Traditional zero-shot TTS works well for single-speaker synthesis but struggles with expressive long-form multi-speaker dialogue. Common workarounds stitch together separate monologue outputs, which breaks acoustic consistency, conversational coherence, and affective continuity across turns. SwanVoice solves this by jointly modeling monologue and dialogue from the ground up. Built on SwanData-Speech, a corpus created from in-the-wild audio using Swan Forced Aligner for pause-aware word-level alignment and RobustMegaTTS3 for hard pronunciation cases, the model can generate coherent speech for 1 to 4 speakers without stitching.
Architecturally, SwanVoice combines a 25 Hz VAE, raw-text conditioning with pause-aware symbols and pinyin substitution, and a flow-matching diffusion transformer (DiT) with speaker-turn conditioning. Training proceeds from monologue speech through mixed and real dialogue data, then uses DiffusionNFT post-training with phone-level and speaker-similarity rewards to refine expressiveness. On SwanBench-Speech, SwanVoice achieves higher richness and hierarchy scores than all evaluated open-source baselines for both monologue and dialogue. Content accuracy remains the main limitation, but the approach marks a significant step toward truly natural conversational AI speech synthesis.
- SwanVoice handles zero-shot synthesis for 1–4 speakers, maintaining acoustic and emotional coherence across dialogue turns.
- Uses a 25 Hz VAE, pause-aware symbols + pinyin substitution, and a flow-matching DiT with speaker-turn conditioning.
- Post-trained with DiffusionNFT using phone-level and speaker-similarity rewards; outperforms open-source baselines on SwanBench-Speech.
Why It Matters
Enables natural, coherent multi-speaker dialogue synthesis without stitching, unlocking better AI voice assistants and audiobook narration.