HaViT: Hybrid-attention based Vision Transformer for Video Classification

Li Li, Liansheng Zhuang, Shenghua Gao, Shafei Wang; Proceedings of the Asian Conference on Computer Vision (ACCV), 2022, pp. 4243-4259

Abstract


Video transformers have become a promising tool for video classification due to its great success in modeling long-range interactions through the self-attention operation. However, existing transformer models only exploit the patch dependencies within a video when doing self-attention, while ignoring the patch dependencies across different videos. This paper argues that external patch prior information is beneficial to the performance of video transformer models for video classification. Motivated by this assumption, this paper proposes a novel Hybrid-attention based Vision Transformer (HaViT) model for video classification, which explicitly exploits both internal patch dependencies within a video and external patch dependencies across videos. Different from existing self-attention, the hybrid-attention is computed based on internal patch tokens and an external patch token dictionary which encodes external patch prior information across different videos. Experiments on Kinetics-400, Kinetics-600 and Something-something-v2 show that our HaViT model achieves state-of-the-art performance in the video classification task against existing methods. Moreover, experiments show that our proposed hybrid-attention scheme can be integrated into existing video transformer models to improve the performance.

Related Material


[pdf] [code]
[bibtex]
@InProceedings{Li_2022_ACCV, author = {Li, Li and Zhuang, Liansheng and Gao, Shenghua and Wang, Shafei}, title = {HaViT: Hybrid-attention based Vision Transformer for Video Classification}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2022}, pages = {4243-4259} }