CMTA: Leveraging Cross-Modal Temporal Artifacts for Generalizable AI-Generated Video Detection
AI-generated videos have a telltale sign in their text-video alignment patterns.
A new research paper introduces CMTA (Cross-Modal Temporal Artifacts), a framework designed to detect AI-generated videos by exploiting a previously overlooked fingerprint: the temporal stability of semantic alignment between visual content and text prompts. Unlike real videos, where the alignment between frames and descriptions naturally fluctuates, AI-generated videos exhibit unnaturally consistent semantic trajectories tied to their input prompts.
To capture this artifact, the CMTA pipeline first generates frame-level captions using BLIP, then extracts visual-textual representations with CLIP. A coarse-grained temporal modeling branch using a GRU characterizes overall alignment fluctuations, while a fine-grained branch with a Transformer encoder captures intricate inter-frame variations. The two branches work together to distinguish real from synthetic content.
Extensive experiments across 40 subsets from four large-scale datasets—GenVideo, EvalCrafter, VideoPhy, and VidProM—demonstrate that CMTA achieves state-of-the-art detection accuracy and generalizes effectively to unseen video generators. This cross-generator robustness is critical as AI video synthesis tools multiply rapidly. The authors plan to release code and models publicly, enabling broader adoption for digital forensics and content moderation.
- CMTA identifies a new fingerprint: cross-modal temporal artifacts (unnaturally stable text-video alignment) in AI videos.
- Uses BLIP for caption generation, CLIP for embeddings, and a dual-branch architecture (GRU + Transformer) for temporal modeling.
- Achieves state-of-the-art results across 40 subsets from four large-scale datasets, with strong cross-generator generalization.
Why It Matters
As AI video tools democratize synthesis, CMTA offers a robust, generalizable method to authenticate digital video content.