MMAudioSep: Taming Video-to-Audio Generative Model Towards Video/Text-Queried Sound Separation
A new model fine-tunes a video-to-audio generator to isolate specific sounds, outperforming existing methods.
A team from Sony R&D and Waseda University has developed MMAudioSep, a novel AI model that can isolate specific sounds from a video based on a visual or textual query. Instead of building a system from the ground up, the researchers took a pre-trained video-to-audio generative model—which already understands the complex relationships between visual scenes and their corresponding sounds—and fine-tuned it for the separation task. This efficient approach leverages existing knowledge, allowing the model to learn sound isolation more effectively than starting with a blank slate.
In evaluations, MMAudioSep demonstrated superior performance compared to existing sound separation models, including both traditional deterministic methods and newer generative approaches. Crucially, the fine-tuning process did not erase the model's original capability; it retained its ability to generate plausible audio for a silent video, showcasing a flexible, multi-purpose architecture. The research, accepted for presentation at the ICASSP 2026 conference, highlights a promising direction: using powerful, general-purpose audio generation models as foundations for specialized downstream applications like audio cleaning, content editing, and advanced media analysis.
- Built by fine-tuning a pre-trained video-to-audio model, avoiding inefficient training from scratch.
- Outperforms existing baseline models for sound separation based on both video and text queries.
- Retains original video-to-audio generation ability after fine-tuning, proving model flexibility.
Why It Matters
Enables precise audio editing and enhancement for video content, from isolating dialogue to cleaning up film audio.