Audio & Speech

CVPR 2026 paper reveals V2A models fail physical reasoning tests

FlatSounds benchmark exposes that captions help physics but ruin timing.

Deep Dive

A team of researchers from NVIDIA, UC Berkeley, and other institutions has unveiled FlatSounds, a benchmark designed to audit the physical reasoning of generative video-to-audio (V2A) models. Unlike existing evaluations that focus on perceptual realism, FlatSounds uses controlled counterfactual pairs—where a single physical factor (e.g., object mass, surface material, or impact speed) is varied—and single-video pattern tests to probe internal consistency and directional trends. The goal is to determine whether generated audio correctly reflects underlying physical properties and timings.

Evaluating state-of-the-art V2A models, the authors discovered a consistent trade-off: models lean heavily on text captions rather than the visual stream to infer physics and semantics. While captions generally improve physical and semantic accuracy, they paradoxically degrade temporal alignment—the precision with which audio events match visual timings. The findings suggest that current architectures prioritize caption-driven shortcuts over learning physical processes directly from pixels. The FlatSounds metrics also show strong correlation with human preference judgments, underscoring their relevance. The paper, accepted at CVPR 2026, calls for a shift in V2A research toward pixel-based physical understanding.

Key Points
  • FlatSounds introduces counterfactual pairs varying single physical factors (e.g., mass, material) to test V2A physics understanding.
  • Current V2A models rely more on text captions than visual input for physics, causing a trade-off: better accuracy but worse temporal alignment.
  • Physics-based metrics from FlatSounds strongly correlate with human preference tests on the benchmark's own data.

Why It Matters

For AI-generated video soundtracks, this work reveals captions aren't a shortcut to physically accurate audio.