UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions
One model generates speech, music, and sound effects from plain text.
Researchers from multiple institutions have developed UniSonate, a unified flow-matching framework that can generate speech, music, and sound effects from text instructions. Accepted as an oral paper at ACL 2026, UniSonate addresses a longstanding challenge in audio AI: unifying fragmented tasks like text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA) under a single control paradigm. The model uses a novel dynamic token injection mechanism to project unstructured environmental sounds into a structured temporal latent space, enabling precise duration control within a phoneme-driven Multimodal Diffusion Transformer (MM-DiT). This approach reconciles the structural dissonance between speech/music's semantic representations and sound effects' acoustic textures.
In extensive experiments, UniSonate achieves state-of-the-art results: 1.47% word error rate for instruction-based TTS and 3.18 SongEval Coherence for TTM, while maintaining competitive fidelity in TTA. Crucially, the model demonstrates positive transfer effects—joint training across diverse audio data significantly improves structural coherence and prosodic expressiveness compared to single-task baselines. A multi-stage curriculum learning strategy mitigates cross-modal optimization conflicts. This unified approach could simplify audio production workflows and enable new applications in content creation, accessibility, and interactive systems.
- UniSonate unifies TTS, TTM, and TTA in one flow-matching framework with text instructions
- Dynamic token injection maps unstructured sounds to structured temporal latent space via MM-DiT
- Achieves 1.47% WER for TTS and 3.18 SongEval Coherence for music, with positive transfer effects
Why It Matters
UniSonate could replace multiple specialized audio models, streamlining content creation and enabling richer multimodal AI applications.