Audio & Speech

AC-Foley: Reference-Audio-Guided Video-to-Audio Synthesis with Acoustic Transfer

New model bypasses text prompts, using audio clips to generate precise Foley sounds for video.

Deep Dive

A research team led by Pengjun Fang has introduced AC-Foley, a novel AI model for video-to-audio (V2A) synthesis that fundamentally changes the control mechanism. Unlike existing methods that rely on text prompts alongside visual data, AC-Foley directly uses reference audio clips to guide the generation of sound effects (Foley). This approach directly tackles two major bottlenecks in the field: the semantic granularity gap in training data, where acoustically distinct sounds are often lumped under coarse labels (e.g., 'footsteps'), and the inherent ambiguity of text when describing subtle acoustic features. By conditioning on audio, the model bypasses these limitations, enabling precise manipulation of acoustic attributes.

AC-Foley's audio-conditioned framework unlocks several advanced capabilities previously difficult to achieve. It allows for fine-grained sound synthesis, where the exact texture and timbre of a reference sound can be transferred to a new visual context. It also facilitates zero-shot sound generation for actions or objects not explicitly seen during training. Empirically, the model sets a new state-of-the-art for Foley generation when an audio reference is provided. Notably, it remains competitive with leading text-guided V2A methods even when operating without an audio condition, showcasing its robust design. The work, accepted at ICLR 2026, includes a public demo, making this precise audio control accessible for testing in film, game development, and content creation pipelines.

Key Points
  • Uses reference audio clips instead of text for precise control, solving textual ambiguity issues.
  • Enables fine-grained sound synthesis, timbre transfer, and zero-shot generation for unseen actions.
  • Achieves state-of-the-art Foley generation with audio guidance and remains competitive without it.

Why It Matters

Enables filmmakers and game developers to create perfectly matched, high-quality sound effects with unprecedented precision and control.