Benchmarking Compact VLMs for Clip-Level Surveillance Anomaly Detection Under Weak Supervision
A new benchmark shows small vision-language models can spot surveillance anomalies as well as larger systems.
A research team from institutions including MDPI's Journal of Imaging has published a comprehensive benchmark evaluating the use of compact Vision-Language Models (VLMs) for detecting anomalies in surveillance footage. The study, "Benchmarking Compact VLMs for Clip-Level Surveillance Anomaly Detection Under Weak Supervision," addresses a critical industry need: systems that are both accurate and fast enough for real-time CCTV monitoring, even when trained with limited, weakly labeled data. The researchers established a unified evaluation protocol to fairly compare parameter-efficiently adapted compact VLMs against training-free VLM pipelines and other weakly supervised baselines.
Key findings reveal that with parameter-efficient fine-tuning (PEFT) techniques, these smaller models achieve detection quality—measured by metrics like F1 score and ROC-AUC—that matches or even surpasses established approaches. Crucially, they do this while retaining competitive average per-clip latency, making them practical for deployment. The adaptation process also made the models less sensitive to variations in text prompts, leading to more consistent performance. This work provides a transparent framework and evidence that compact VLMs offer a favorable accuracy-efficiency trade-off, enabling more cost-effective and deployable AI for security and safety monitoring.
- Compact VLMs, when adapted with PEFT, achieved performance on par with or exceeding larger established models for anomaly detection.
- The study's unified protocol standardized evaluation across accuracy, recall, F1, ROC-AUC, and critical per-clip latency metrics.
- Parameter-efficient adaptation reduced prompt sensitivity, yielding more consistent model behavior suitable for real-world, weakly supervised settings.
Why It Matters
Enables cheaper, faster, and more reliable AI surveillance systems that can be deployed with less labeled data.