New Deepfake Detection Method Achieves 0.905 AUC by Indexing Facial Regions
Selecting only mouth or eye tokens boosts accuracy by 16.9 percentage points over Xception.
Traditional deepfake detectors often pool all facial features into a single representation, which can dilute manipulation cues and make explanations opaque. Al-Zyoud and El Saddik flip this design: they first segment facial patch tokens into semantic regions using a frozen FaRL parser (e.g., mouth, eyes, nose), then classify only the relevant region. This leverages DINOv3's spatial consistency to present a purer subspace for manipulation evidence. The method requires no fine-tuning of DINOv3 or FaRL, and no target-domain data—only a linear probe trained on region-specific tokens.
On the challenging Celeb-DF v2 dataset, the mouth-indexed probe achieves AUC 0.905, a gain of +8.1 percentage points over LipForensics and +16.9 pp over Xception. Ablation studies confirm both DINOv3 representations and spatial indexing are independently necessary: dropping regional selection slashes AUC by 26.4 pp; replacing DINOv3 with FaRL features drops it by 20.9 pp. The system is also naturally explainable—when the mouth model predicts fake, the decision relied solely on mouth tokens, not a post-hoc saliency map. This work offers a practical, generalizable, and transparent approach for real-world deepfake detection.
- Uses a frozen FaRL parser to assign semantic labels (mouth, eyes, etc.) to DINOv3 ViT-L/16 patch tokens, discarding non-target regions before classification.
- Achieves AUC 0.905 on Celeb-DF v2, outperforming LipForensics by 8.1 percentage points and Xception by 16.9 percentage points without any fine-tuning or target-domain data.
- Explainable by design: predictions are attributed to specific facial regions (e.g., mouth only), eliminating the need for opaque saliency maps.
Why It Matters
Enables more accurate and transparent deepfake detection for media forensics, with no retraining needed for new domains.