Zero-Shot Action Recognition With Transformer-Based Video Semantic Embedding

Keval Doshi, Yasin Yilmaz; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2023, pp. 4859-4868

Abstract


While video action recognition has been an active area of research for several years, zero-shot action recognition has only recently started gaining traction. In this work, we propose a novel end-to-end trained transformer model which is capable of capturing long range spatiotemporal dependencies efficiently, contrary to existing approaches which use 3D-CNNs. Moreover, to address a common ambiguity in the existing works about classes that can be considered as previously unseen, we propose a new experimentation setup that satisfies the zero-shot learning premise for action recognition by avoiding overlap between the training and testing classes. The proposed approach significantly outperforms the state of the arts in zero-shot action recognition in terms of the the top-1 accuracy on UCF-101, HMDB-51 and ActivityNet datasets.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Doshi_2023_CVPR, author = {Doshi, Keval and Yilmaz, Yasin}, title = {Zero-Shot Action Recognition With Transformer-Based Video Semantic Embedding}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2023}, pages = {4859-4868} }