ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot End-to-End Temporal Action Detection

Phan, Thinh; Vo, Khoa; Le, Duy; Doretto, Gianfranco; Adjeroh, Donald; Le, Ngan

Thinh Phan, Khoa Vo, Duy Le, Gianfranco Doretto, Donald Adjeroh, Ngan Le; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 7046-7055

Abstract

Temporal action detection (TAD) involves the localization and classification of action instances within untrimmed videos. While standard TAD follows fully supervised learning with closed-set setting on large training data, recent zero-shot TAD methods showcase the promising openset setting by leveraging large-scale contrastive visuallanguage (ViL) pretrained models. However, existing zeroshot TAD methods have limitations on how to properly construct the strong relationship between two Interdependent tasks of localization and classification and adapt ViL model to video understanding. In this work, we present ZEETAD, featuring two modules: dual-localization and zeroshot proposal classification. The former is a Transformerbased module that detects action events while selectively collecting crucial semantic embeddings for later Recognition. The latter one, CLIP-based module, generates semantic embeddings from text and frame inputs for each temporal unit. Additionally, we enhance discriminative capability on unseen classes by minimally updating the frozen CLIP encoder with lightweight adapters. Extensive experiments on THUMOS14 and ActivityNet-1.3 datasets demonstrate our approach's superior performance in zero-shot TAD and effective knowledge transfer from ViL models to unseen action categories. Code is available at https: //github.com/UARK-AICV/ZEETAD.

Related Material

[pdf] [arXiv]

[bibtex]

@InProceedings{Phan_2024_WACV, author = {Phan, Thinh and Vo, Khoa and Le, Duy and Doretto, Gianfranco and Adjeroh, Donald and Le, Ngan}, title = {ZEETAD: Adapting Pretrained Vision-Language Model for Zero-Shot End-to-End Temporal Action Detection}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2024}, pages = {7046-7055} }