Hybrid cross-attention fusion model achieves 91.7% accuracy on audio-visual events
New framework combines VideoMAE and AST with bidirectional cross-attention for urban monitoring.
A new paper from researchers Parinaz Binandeh Dehaghani, Danilo Pena, and A. Pedro Aguiar introduces a "Stable Hybrid Cross-Attention Fusion" framework for Audio-Visual Event Recognition (AVER) designed specifically for intelligent urban monitoring systems. The architecture leverages pretrained Video Masked Autoencoders (VideoMAE) for video features and Audio Spectrogram Transformers (AST) for audio features, combined with FiLM-based audio conditioning to modulate video representations. A bidirectional cross-attention fusion module then exchanges information between modalities, followed by a multimodal Transformer encoder and a modality-temporal attention mechanism. To boost computational efficiency and training stability, the backbone models are frozen and cached feature extraction is used, eliminating the need for end-to-end fine-tuning.
Extensive experiments on the standard AVE dataset demonstrate the framework's effectiveness. It achieves a best validation accuracy of 91.74% and a test accuracy of 83.85% ± 1.40% over five independent runs, consistently outperforming both unimodal (video-only or audio-only) and multimodal baselines. The hybrid fusion strategy proves robust in challenging real-world urban scenarios where background noise or occlusions may degrade single-modality performance. The method captures complementary information—e.g., visual cues of a car crash combined with the sound of screeching tires. The paper is published on arXiv (2606.03747) and suggests that this approach could be integrated into smart city infrastructure for automated event detection and response.
- Framework combines frozen VideoMAE and AST backbones with FiLM conditioning and bidirectional cross-attention fusion.
- Best validation accuracy of 91.74% and test accuracy of 83.85% ±1.40% on the AVE dataset over five runs.
- Frozen pretrained backbones and cached feature extraction improve training stability and reduce compute requirements.
Why It Matters
Enables robust, real-time audio-visual event detection for smart city monitoring with state-of-the-art accuracy.