New CFM model rebalances audio with video guidance, beats SOTA
Generative audio remixing outperforms discriminative models by up to 20%
Visually-guided acoustic highlighting aims to rebalance audio tracks so that the sound aligns with the visual focus of a video. Existing discriminative models struggle because there is no one-to-one mapping between poorly-balanced and well-balanced audio mixes. To address this, the authors (Malard et al.) reframe the task as a generative problem using a Conditional Flow Matching (CFM) framework. They introduce a rollout loss that penalizes trajectory drift over multiple steps, encouraging self-correction and stable long-range flow integration. Additionally, a conditioning module fuses audio and visual features before vector field regression, enabling explicit cross-modal source selection.
Extensive quantitative and qualitative evaluations show that the CFM method consistently outperforms the previous discriminative state-of-the-art. The authors demonstrate that generative modeling is the best approach for visually-guided audio remixing. This work opens new possibilities for creating coherent audio-visual experiences in video editing, film production, and assistive technologies. The paper is available on arXiv (2602.03762).
- Conditional Flow Matching framework reframes audio remixing as generative, overcoming one-to-one mapping ambiguity
- Rollout loss penalizes early prediction drift, stabilizing long-range flow integration
- Cross-modal conditioning module fuses audio and visual cues before vector field regression for explicit source selection
Why It Matters
Enables precise audio-visual alignment in video editing and production, improving viewer experience and accessibility.