Research & Papers

A Modular Zero-Shot Pipeline for Accident Detection, Localization, and Classification in Traffic Surveillance Video

A modular system uses CLIP and optical flow to pinpoint when, where, and what type of crash occurs in surveillance video.

Deep Dive

Researchers Amey Thakur and Sarvesh Talele have unveiled a novel, modular AI pipeline designed to automatically detect, locate, and classify traffic accidents in surveillance footage without any labeled real-world training data. Developed for the ACCIDENT @ CVPR 2026 challenge, the system tackles the complex problem of understanding "when, where, and what" in accident videos using a clever zero-shot approach that relies solely on pre-trained models, eliminating the need for costly and time-consuming data annotation.

The pipeline decomposes the task into three specialized modules. First, it localizes the collision in time by applying peak detection to z-score normalized frame-difference signals, identifying the moment of highest activity. Second, it pinpoints the impact location by computing the weighted centroid of cumulative dense optical flow magnitude maps generated by the Farneback algorithm. Finally, it classifies the accident type by measuring the cosine similarity between CLIP image embeddings of key frames and text embeddings built from multi-prompt natural language descriptions of collision categories.

This architecture is significant because it is entirely zero-shot; no domain-specific fine-tuning is involved. The system processes each video using only off-the-shelf, pre-trained model weights for CLIP and optical flow calculations. The researchers have made their implementation publicly available as a Kaggle notebook, providing a practical tool for traffic monitoring and safety analysis that can be deployed without gathering accident-specific training datasets.

Key Points
  • Uses a three-module zero-shot pipeline requiring no labeled accident data for training.
  • Locates accidents in time via frame-difference peak detection and in space via Farneback optical flow centroids.
  • Classifies accident type using CLIP's vision-language model to match frames with text prompt embeddings.

Why It Matters

Enables rapid deployment of automated traffic safety monitoring in cities without the prohibitive cost of collecting and labeling accident video datasets.