Test-Time Zero-Shot Temporal Action Localization

Benedetta Liberatori, Alessandro Conti, Paolo Rota, Yiming Wang, Elisa Ricci; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 18720-18729

Abstract


Zero-Shot Temporal Action Localization (ZS-TAL) seeks to identify and locate actions in untrimmed videos unseen during training. Existing ZS-TAL methods involve fine-tuning a model on a large amount of annotated training data. While effective training-based ZS-TAL approaches assume the availability of labeled data for supervised learning which can be impractical in some applications. Furthermore the training process naturally induces a domain bias into the learned model which may adversely affect the model's generalization ability to arbitrary videos. These considerations prompt us to approach the ZS-TAL problem from a radically novel perspective relaxing the requirement for training data. To this aim we introduce a novel method that performs Test-Time adaptation for Temporal Action Localization (T3AL). In a nutshell T3AL adapts a pre-trained Vision and Language Model (VLM). T3AL operates in three steps. First a video-level pseudo-label of the action category is computed by aggregating information from the entire video. Then action localization is performed adopting a novel procedure inspired by self-supervised learning. Finally frame-level textual descriptions extracted with a state-of-the-art captioning model are employed for refining the action region proposals. We validate the effectiveness of T 3AL by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results demonstrate that T3AL significantly outperforms zero-shot baselines based on state-of-the-art VLMs confirming the benefit of a test-time adaptation approach.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Liberatori_2024_CVPR, author = {Liberatori, Benedetta and Conti, Alessandro and Rota, Paolo and Wang, Yiming and Ricci, Elisa}, title = {Test-Time Zero-Shot Temporal Action Localization}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {18720-18729} }