Audio & Speech

UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions

arXiv eess.AS April 27, 2026

⚡One model generates speech, music, and sound effects from plain text.

Deep Dive

Researchers from multiple institutions have developed UniSonate, a unified flow-matching framework that can generate speech, music, and sound effects from text instructions. Accepted as an oral paper at ACL 2026, UniSonate addresses a longstanding challenge in audio AI: unifying fragmented tasks like text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA) under a single control paradigm. The model uses a novel dynamic token injection mechanism to project unstructured environmental sounds into a structured temporal latent space, enabling precise duration control within a phoneme-driven Multimodal Diffusion Transformer (MM-DiT). This approach reconciles the structural dissonance between speech/music's semantic representations and sound effects' acoustic textures.

In extensive experiments, UniSonate achieves state-of-the-art results: 1.47% word error rate for instruction-based TTS and 3.18 SongEval Coherence for TTM, while maintaining competitive fidelity in TTA. Crucially, the model demonstrates positive transfer effects—joint training across diverse audio data significantly improves structural coherence and prosodic expressiveness compared to single-task baselines. A multi-stage curriculum learning strategy mitigates cross-modal optimization conflicts. This unified approach could simplify audio production workflows and enable new applications in content creation, accessibility, and interactive systems.

Key Points

UniSonate unifies TTS, TTM, and TTA in one flow-matching framework with text instructions
Dynamic token injection maps unstructured sounds to structured temporal latent space via MM-DiT
Achieves 1.47% WER for TTS and 3.18 SongEval Coherence for music, with positive transfer effects

Why It Matters

UniSonate could replace multiple specialized audio models, streamlining content creation and enabling richer multimodal AI applications.

Read Original Article

UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions

Why It Matters

Stay Ahead in AI