Video Attribute Prototype Network: A New Perspective for Zero-Shot Video Classification

Bo Wang, Kaili Zhao, Hongyang Zhao, Shi Pu, Bo Xiao, Jun Guo; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2023, pp. 315-324

Abstract


Video attributes, which leverage video contents to instantiate class semantics, play a critical role in diversifying semantics in zero-shot video classification, thereby facilitating semantic transfer from seen to unseen classes. However, few presences discuss video attributes, and most methods consider class names as class semantics that tend to be loosely defined. In this paper, we propose a Video Attribute Prototype Network (VAPNet) to generate video attributes that learns in-context semantics between video captions and class semantics. Specifically, we introduce a cross-attention module in the Transformer decoder by considering video captions as queries to probe and pool semantic-associated class-wise features. To alleviate noises in pre-extracted captions, we learn caption features through a stochastic representation derived from a Gaussian representation where the variance encodes uncertainties. We utilize a joint video-to-attribute and video-to-video contrastive loss to calibrate visual and semantic features. Experiments show that VAPNet significantly outperforms SoTA by relative improvements of 14.3% on UCF101 and 8.8% on HMDB51, and further surpasses the pre-trained vision-language SoTA by 4.1% and 17.2%. Code is available.

Related Material


[pdf]
[bibtex]
@InProceedings{Wang_2023_ICCV, author = {Wang, Bo and Zhao, Kaili and Zhao, Hongyang and Pu, Shi and Xiao, Bo and Guo, Jun}, title = {Video Attribute Prototype Network: A New Perspective for Zero-Shot Video Classification}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2023}, pages = {315-324} }