Semantics Guided Contrastive Learning of Transformers for Zero-Shot Temporal Activity Detection

Sayak Nag, Orpaz Goldstein, Amit K. Roy-Chowdhury; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 6243-6253

Abstract


Zero-shot temporal activity detection (ZSTAD) is the problem of simultaneous temporal localization and classification of activity segments that are previously unseen during training. This is achieved by transferring the knowledge learned from semantically-related seen activities. This ability to reason about unseen concepts without supervision makes ZSTAD very promising for applications where the acquisition of annotated training videos is difficult. In this paper, we design a transformer-based framework titled TranZAD, which streamlines the detection of unseen activities by casting ZSTAD as a direct set-prediction problem, removing the need for hand-crafted designs and manual post-processing. We show how a semantic information-guided contrastive learning strategy can effectively train TranZAD for the zero-shot setting, enabling the efficient transfer of knowledge from the seen to the unseen activities. To reduce confusion between unseen activities and unrelated background information in videos, we introduce a more efficient method of computing the background class embedding by dynamically adapting it as part of the end-to-end learning. Additionally, unlike existing work on ZSTAD, we do not assume the knowledge of which classes are unseen during training and use the visual and semantic information of only the seen classes for the knowledge transfer. This makes TranZAD more viable for practical scenarios, which we evaluate by conducting extensive experiments on Thumos'14 and Charades.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Nag_2023_WACV, author = {Nag, Sayak and Goldstein, Orpaz and Roy-Chowdhury, Amit K.}, title = {Semantics Guided Contrastive Learning of Transformers for Zero-Shot Temporal Activity Detection}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2023}, pages = {6243-6253} }