RiTTA: Modeling Event Relations in Text-to-Audio Generation
New framework teaches AI to understand 'car honking after door slams' vs. 'door slams after car honks'.
A research team from the University of Oxford and other institutions has published RiTTA, a novel framework designed to solve a fundamental problem in Text-to-Audio (TTA) generation. While models like Meta's AudioGen or Stability AI's Stable Audio can create high-fidelity sounds, they often fail to correctly interpret the relationships between multiple events in a prompt. For instance, they might confuse "a door slams, then a car honks" with the reverse sequence. RiTTA tackles this by first establishing a comprehensive benchmark, including a new corpus covering common audio events and their potential relations (like temporal order, causality, or simultaneity), alongside new evaluation metrics to assess a model's relational understanding.
Beyond diagnosis, RiTTA provides a solution: a fine-tuning framework that can be applied to existing TTA models to enhance their event-relation modeling capabilities. This means developers can retrofit models like AudioLDM or Make-An-Audio to better handle complex, multi-step audio scenes. The work, accepted at EMNLP 2025, provides open-source code and a project site, making it a practical tool for the audio AI community. It shifts the focus from generating isolated sounds to creating coherent auditory narratives where the sequence and interaction of events matter, which is crucial for applications in film scoring, game development, and immersive media.
- Introduces the first systematic benchmark for evaluating how well TTA models understand relationships (like 'before', 'after', 'caused by') between audio events in a prompt.
- Provides a fine-tuning framework to upgrade existing models (e.g., AudioGen, Stable Audio) to better generate complex, multi-event audio scenes with correct sequencing.
- Includes a new, comprehensive audio event relation corpus and specialized evaluation metrics, filling a major gap in TTA research and development.
Why It Matters
Enables AI to generate coherent soundscapes for films, games, and VR, moving beyond isolated sound effects to dynamic auditory stories.