Audio & Speech

CounterFlow generates contradictory video sounds with two-phase sampling

Adding a barking sound to a silent cat video while preserving lip sync is now possible.

Deep Dive

CounterFlow addresses Counterfactual Video Foley Generation, which requires adding a sound source that contradicts visual evidence (e.g., a cat video with barking) while staying temporally synchronized. Existing VT2A models fail when video and text prompts conflict, remaining anchored to the visually implied source. The method uses a two-phase inference-time sampling scheme for pretrained flow-matching models. In Phase 1, video conditioning guides temporal alignment but actively suppresses the originally implied sound identity. Phase 2 then drops video entirely, letting the model shape the audio's timbre purely toward the target text prompt. This decoupling enables counterfactual replacements without retraining.

The paper also proposes a novel evaluation metric that leverages a text-audio co-embedding space to measure both target-prompt fidelity and residual leakage from the visually implied source. CounterFlow substantially outperforms naive negative prompting and state-of-the-art baselines. Video demos and code are available. The work is accepted to CVPR 2026 Workshop on Sight and Sound, highlighting its relevance for AI-powered video editing and creative media production.

Key Points
  • Two-phase inference-time sampling: Phase 1 suppresses video-implied sound source, Phase 2 focuses on target prompt timbre.
  • Evaluated with a custom metric using text-audio co-embedding to measure target fidelity and source leakage.
  • Accepted to CVPR 2026 Workshop on Sight and Sound; code and demos publicly available.

Why It Matters

Enables creative audio editing where sound defies visual cues—useful for post-production, dubbing, and AI-generated media.