-
[pdf]
[supp]
[bibtex]@InProceedings{Gao_2024_ACCV, author = {Gao, Shibo and Yang, Peipei and Huang, Linlin}, title = {Scene-Adaptive SVAD Based On Multi-modal Action-based Feature Extraction}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2024}, pages = {2471-2488} }
Scene-Adaptive SVAD Based On Multi-modal Action-based Feature Extraction
Abstract
Due to the lack of anomalous data, most existing semi-supervised video anomaly detection (SVAD) methods rely on designing self-supervised tasks to reconstruct video frames for learning normal patterns from training data, thereby distinguishing anomalous events from normal ones according to the reconstruction quality. However, these methods heavily rely on the frequency of event occurring to judge its abnormality, which often misidentify rare normal events as anomalies. More importantly, they are usually trained to fit a particular scene leading to poor generalization to other scenes. Besides, for all existing methods, the normal/abnormal events are fixed once the training is finished, and cannot conduct test-time adjust without retraining the model. To resolve these problems, we propose a semi-supervised video anomaly detection method based on a multi-modal action-based feature extraction model. Our method exploits a vision-language model pre-trained with an action recognition task for action-based feature extraction, making it robust to scene variations irrelevant to anomalies. A clustering model with learnable prompts is employed for learning the normal patterns and anomaly detection, which does not rely on event frequency and can correctly identify rare normal events. Benefiting from the multi-modal model, our method can conveniently adjust the normal events during test time by text guidance without retraining. We conduct experiments on benchmark datasets and the results demonstrate that our method achieves the start-of-the-art performances. More importantly, our method exhibits obviously better performances in cross-scene experiment and test-time anomalies adjustment experiment.
Related Material