COPRA uses RL to adapt VLMs per video segment for anomaly detection
VLMs adapt dynamically using reinforcement learning, outperforming static baselines in video anomaly detection.
Current vision-language models (VLMs) for video anomaly detection (VAD) suffer from a fundamental mismatch between training and inference: they are typically adapted with static post-training methods and trained on sparse frames but tested on dense segments. This limits generalization under distribution shifts like unseen environments or anomaly types. To solve this, researchers from multiple institutions introduce COPRA, a conditional parameter adaptation framework that leverages reinforcement learning (RL) to generate input-specific parameter updates for a frozen VLM on each video segment. Instead of shared prompts or parameter updates, COPRA dynamically adjusts the model's weights per input, ensuring consistent adaptation during both training and inference.
Experiments on standard VAD benchmarks show COPRA consistently outperforms static baselines in both in-domain and cross-domain evaluations. Beyond anomaly detection, COPRA generalizes to unseen tasks such as multiple-choice Video Question Answering and Dense Captioning, demonstrating its effectiveness as a weight-space generation framework for scalable, adaptive video understanding. The code will be released to support further research. This work highlights a shift from one-size-fits-all VLM adaptation to context-aware, per-segment tuning, promising more robust video analytics in real-world deployment.
- COPRA uses reinforcement learning to generate input-specific parameter updates per video segment for frozen VLMs
- Outperforms static baselines on standard VAD benchmarks in both in-domain and cross-domain settings
- Generalizes beyond anomaly detection to multiple-choice Video QA and Dense Captioning tasks
Why It Matters
Dynamic per-segment adaptation enables more accurate and generalizable video anomaly detection across diverse environments.