LongAV-Compass benchmark evaluates minute-long audio-video generation across 3 modalities
284 test cases and 20+ metrics reveal where current systems fail at long-form AI video.
Get AI news that actually matters
One email a day. Zero fluff. Join 10,000+ professionals.
Researchers introduce LongAV-Compass, a benchmark for minute-scale audio-visual generation. It covers text-to-audio-video (T2AV), image-to-audio-video (I2AV), and video-to-audio-video (V2AV) with 284 curated test cases. The evaluation framework combines MLLM-assisted scoring with metrics like DINO-v2 and ArcFace across 20+ dimensions—from segment quality to narrative coherence. The benchmark includes experiments on 11 representative models together with human-alignment validation, providing a diagnostic testbed for analyzing limitations in sustaining coherent, semantically aligned, and temporally consistent minute-scale generation across diverse input modalities.
- 284 curated test cases across T2AV, I2AV, and V2AV modalities, categorized by scenario and complexity.
- Unified evaluation using MLLM assistance plus DINO-v2, ArcFace, CLIP, and ImageBind across 20+ dimensions.
- Tests on 11 models expose degradation in identity, narrative coherence, and synchronization over minute-long clips.
Why It Matters
Enables systematic evaluation of long-form AI video for applications in film, streaming, and immersive media.