- [pdf] [supp]
Recurring the Transformer for Video Action Recognition
Existing video understanding approaches, such as 3D convolutional neural networks and Transformer-Based methods, usually process the videos in a clip-wise manner. Hence huge GPU memory is needed, and fixed-length video clips are usually required. We introduce a novel Recurrent Vision Transformer (RViT) framework for spatial-temporal representation learning to achieve the video action recognition task. Specifically, the proposed RViT is equipped with an attention gate which is utilized to build interaction between current frame input and previous hidden state, thus aggregating the global level inter-frame features through the hidden state. RViT is executed recurrently to process a video clip by giving the current frame and previous hidden state. The RViT can capture both spatial and temporal features because of the attention gate and recurrent execution. Besides, the proposed RViT can work on both fixed-length and variant-length video clips properly without requiring large GPU memory thanks to the frame by frame processing flow. Our experiment results verify that RViT can achieve state-of-the-art performance on various datasets for the video recognition task. Specifically, RViT can achieve a top-1 accuracy of 81.5% on Kinetics-400, 92.31% on Jester, 67.9% on Something-Something-V2, and an mAP accuracy of 66.1% on Charades.