CounterFlow generates contradictory video sounds with two-phase sampling
Adding a barking sound to a silent cat video while preserving lip sync is now possible.
CounterFlow addresses Counterfactual Video Foley Generation, which requires adding a sound source that contradicts visual evidence (e.g., a cat video with barking) while staying temporally synchronized. Existing VT2A models fail when video and text prompts conflict, remaining anchored to the visually implied source. The method uses a two-phase inference-time sampling scheme for pretrained flow-matching models. In Phase 1, video conditioning guides temporal alignment but actively suppresses the originally implied sound identity. Phase 2 then drops video entirely, letting the model shape the audio's timbre purely toward the target text prompt. This decoupling enables counterfactual replacements without retraining.
The paper also proposes a novel evaluation metric that leverages a text-audio co-embedding space to measure both target-prompt fidelity and residual leakage from the visually implied source. CounterFlow substantially outperforms naive negative prompting and state-of-the-art baselines. Video demos and code are available. The work is accepted to CVPR 2026 Workshop on Sight and Sound, highlighting its relevance for AI-powered video editing and creative media production.
- Two-phase inference-time sampling: Phase 1 suppresses video-implied sound source, Phase 2 focuses on target prompt timbre.
- Evaluated with a custom metric using text-audio co-embedding to measure target fidelity and source leakage.
- Accepted to CVPR 2026 Workshop on Sight and Sound; code and demos publicly available.
Why It Matters
Enables creative audio editing where sound defies visual cues—useful for post-production, dubbing, and AI-generated media.