Cross-Modal Fusion and Attention Mechanism for Weakly Supervised Video Anomaly Detection

Ayush Ghadiya, Purbayan Kar, Vishal Chudasama, Pankaj Wasnik; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 1965-1974

Abstract


Recently weakly supervised video anomaly detection (WS-VAD) has emerged as a contemporary research direction to identify anomaly events like violence and nudity in videos using only video-level labels. However this task has substantial challenges including addressing imbalanced modality information and consistently distinguishing between normal and abnormal features. In this paper we address these challenges and propose a multi-modal WS-VAD framework to accurately detect anomalies such as violence and nudity. Within the proposed framework we introduce a new fusion mechanism known as the Cross-modal Fusion Adapter (CFA) which dynamically selects and enhances highly relevant audio-visual features in relation to the visual modality. Additionally we introduce a Hyperbolic Lorentzian Graph Attention (HLGAtt) to effectively capture the hierarchical relationships between normal and abnormal representations thereby enhancing feature separation accuracy. Through extensive experiments we demonstrate that the proposed model achieves state-of-the-art results on benchmark datasets of violence and nudity detection.

Related Material


[pdf]
[bibtex]
@InProceedings{Ghadiya_2024_CVPR, author = {Ghadiya, Ayush and Kar, Purbayan and Chudasama, Vishal and Wasnik, Pankaj}, title = {Cross-Modal Fusion and Attention Mechanism for Weakly Supervised Video Anomaly Detection}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {1965-1974} }