Artifact-Bench evaluates 19 leading MLLMs across three tasks?

real vs. AI classification, pairwise realism comparison, and fine-grained artifact identification.

The benchmark uses a three-level taxonomy covering photorealistic, animated, and CG-style videos, with many models performing near or below random on non-photorealistic content?

The benchmark uses a three-level taxonomy covering photorealistic, animated, and CG-style videos, with many models performing near or below random on non-photorealistic content.

Significant misalignment found between MLLM judgments and human perceptual preferences, limiting their reliability as evaluators of AI-generated video realism?

Significant misalignment found between MLLM judgments and human perceptual preferences, limiting their reliability as evaluators of AI-generated video realism.

Research & Papers

Artifact-Bench tests 19 MLLMs on AI video flaws – most fail

arXiv cs.CV May 20, 2026

⚡Many models performed near-random or below-random on detecting AI video artifacts.

Deep Dive

Recent advances in video generation have made AI-produced content increasingly realistic, but artifacts like temporal glitches, structural distortions, and semantic inconsistencies remain common. The ability of Multimodal Large Language Models (MLLMs) to perceive and reason about such flaws has been largely unclear. To address this, a team of 24 researchers led by Yuqi Tang introduced Artifact-Bench, a comprehensive benchmark designed to systematically evaluate MLLMs on artifact detection and analysis across diverse video domains—not just photorealistic content but also animated and computer-generated (CG) styles.

The benchmark defines a three-level hierarchical taxonomy of realism artifacts and structures evaluation around three complementary tasks: binary classification (real vs. AI-generated video), pairwise realism comparison, and fine-grained artifact identification. Experiments with 19 state-of-the-art MLLMs revealed that many models performed at or even below random chance in these tasks, especially for non-photorealistic videos. Moreover, the study found a significant misalignment between MLLM judgments and human perceptual preferences, indicating that current models are not reliable as general-purpose evaluators of AI video realism. These results highlight a critical gap in the ability of even advanced MLLMs to serve as quality assurance tools for AI-generated video content.

Key Points

Artifact-Bench evaluates 19 leading MLLMs across three tasks: real vs. AI classification, pairwise realism comparison, and fine-grained artifact identification.
The benchmark uses a three-level taxonomy covering photorealistic, animated, and CG-style videos, with many models performing near or below random on non-photorealistic content.
Significant misalignment found between MLLM judgments and human perceptual preferences, limiting their reliability as evaluators of AI-generated video realism.

Why It Matters

As AI video tools proliferate, reliable artifact detection is essential for quality control and building user trust.

Read Original Article

Artifact-Bench tests 19 MLLMs on AI video flaws – most fail

Why It Matters

Related Articles

🚀 Stay Ahead in AI