Research & Papers

Reasoning-Guided Grounding: Elevating Video Anomaly Detection through Multimodal Large Language Models

94% ROC-AUC on UCF-Crime while explaining why something is anomalous.

Deep Dive

Traditional video anomaly detection (VAD) treats the task as binary classification or outlier detection, offering no reasoning for why something is anomalous and no precise spatial localization. Researchers have now introduced VANGUARD (Video Anomaly Understanding through Reasoning and Grounding), a framework that embeds anomaly classification, spatial grounding, and chain-of-thought reasoning into a single Vision-Language Model. The system uses a three-stage curriculum: first, a classifier warmup on frozen backbone features; second, LoRA-adapted spatial grounding; and third, chain-of-thought generation. To overcome sparse annotations in VAD benchmarks, they employ a teacher-student pipeline where Qwen3-VL-4B generates structured reasoning trajectories from UCA Dataset annotations, and GroundingDINO provides bounding box supervision.

On the UCF-Crime benchmark, VANGUARD achieves 94% ROC-AUC with 84% F1 while simultaneously outputting interpretable chain-of-thought explanations and bounding boxes for anomalous objects—capabilities absent from prior methods. Ablations confirm that staged training outperforms monolithic optimization, and that structured reasoning acts as an implicit regularizer yielding more balanced predictions. Zero-shot transfer to XD-Violence and ShanghaiTech demonstrates cross-domain generalization without target-domain adaptation. This work, currently under review, represents a shift from black-box anomaly detection to explainable, grounded video understanding.

Key Points
  • 94% ROC-AUC and 84% F1 on UCF-Crime, surpassing prior binary classification VAD methods.
  • Three-stage curriculum: classifier warmup → LoRA spatial grounding → chain-of-thought reasoning generation.
  • Teacher-student pipeline using Qwen3-VL-4B for reasoning trajectories and GroundingDINO for bounding box supervision.

Why It Matters

Enables explainable, grounded video anomaly detection with cross-domain generalization, moving beyond black-box binary classifiers.