Finetuning a Text-to-Audio Model for Room Impulse Response Generation
Using vision-language models to label existing datasets, researchers create text-to-RIR generation for the first time.
A new research paper from Kirak Kim and Sungyoung Kim, submitted to Interspeech 2026, presents a breakthrough in acoustic simulation. The team has successfully fine-tuned a pre-trained text-to-audio model to generate Room Impulse Responses (RIRs)—the acoustic fingerprints of physical spaces. This marks the first demonstration that large-scale generative audio models can be effectively repurposed for this specialized task, bypassing the traditional, labor-intensive process of recording real-world RIRs.
To solve the critical problem of scarce text-RIR paired data, the researchers engineered a clever labeling pipeline. They utilized vision-language models to analyze existing image-RIR datasets, automatically extracting descriptive acoustic text labels. This created the necessary training data. The model also incorporates an in-context learning strategy, allowing it to accept free-form user prompts during inference, such as "a large, empty concert hall" or "a small, carpeted bedroom."
Evaluations show the model's practical utility. In MUSHRA (MUltiple Stimuli with Hidden Reference and Anchor) listening tests, the AI-generated RIRs were judged as plausible. Furthermore, when used for speech data augmentation in downstream Automatic Speech Recognition (ASR) tasks, the synthetic RIRs proved effective, enhancing model performance. This work opens new avenues for audio engineers and AI researchers to rapidly prototype acoustic environments for multimedia production, VR/AR, and robust speech system training.
- First-ever method to generate Room Impulse Responses (RIRs) by fine-tuning a text-to-audio model, submitted as arXiv:2603.09708.
- Solved data scarcity by using vision-language models to create text labels from existing image-RIR datasets.
- Model enables free-form text prompts for RIR generation and proved effective in MUSHRA tests and ASR data augmentation.
Why It Matters
Dramatically reduces cost and time for creating realistic audio simulations, impacting film, gaming, VR, and speech AI development.