Audio & Speech

MMAudioReverbs: Video-Guided Acoustic Modeling for Dereverberation and Room Impulse Response Estimation

New model removes reverb and maps room geometry from just a video clip.

Deep Dive

Researchers introduce MMAudioReverbs, a framework built on the MMAudio video-to-audio model that adds room‑acoustic capabilities without changing the network architecture. It performs dereverberation and room impulse response (RIR) estimation using both audio and visual cues. Fine‑tuned on a small dataset, experimental results showed that audio and visual cues respectively have advantage depending on the type of physical room acoustics.

Key Points
  • Built on top of MMAudio, a state‑of‑the‑art video‑to‑audio model, without any architectural changes.
  • Fine‑tuned on a small dataset to perform dereverberation and room impulse response estimation simultaneously.
  • Achieves top results on CVPR 2026 benchmarks, with audio and visual modalities each excelling at different acoustic properties.

Why It Matters

Enables automatic audio cleanup and room mapping from standard videos, improving conferencing, VR, and film production without specialized equipment.