Research & Papers

LongAV-Compass benchmark evaluates minute-long audio-video generation across 3 modalities

284 test cases and 20+ metrics reveal where current systems fail at long-form AI video.

Deep Dive

Researchers introduce LongAV-Compass, a benchmark for minute-scale audio-visual generation. It covers text-to-audio-video (T2AV), image-to-audio-video (I2AV), and video-to-audio-video (V2AV) with 284 curated test cases. The evaluation framework combines MLLM-assisted scoring with metrics like DINO-v2 and ArcFace across 20+ dimensions—from segment quality to narrative coherence. The benchmark includes experiments on 11 representative models together with human-alignment validation, providing a diagnostic testbed for analyzing limitations in sustaining coherent, semantically aligned, and temporally consistent minute-scale generation across diverse input modalities.

Key Points
  • 284 curated test cases across T2AV, I2AV, and V2AV modalities, categorized by scenario and complexity.
  • Unified evaluation using MLLM assistance plus DINO-v2, ArcFace, CLIP, and ImageBind across 20+ dimensions.
  • Tests on 11 models expose degradation in identity, narrative coherence, and synchronization over minute-long clips.

Why It Matters

Enables systematic evaluation of long-form AI video for applications in film, streaming, and immersive media.