Multi-Modal Multi-Action Video Recognition

Zhensheng Shi, Ju Liang, Qianqian Li, Haiyong Zheng, Zhaorui Gu, Junyu Dong, Bing Zheng; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 13678-13687


Multi-action video recognition is much more challenging due to the requirement to recognize multiple actions co-occurring simultaneously or sequentially. Modeling multi-action relations is beneficial and crucial to understand videos with multiple actions, and actions in a video are usually presented in multiple modalities. In this paper, we propose a novel multi-action relation model for videos, by leveraging both relational graph convolutional networks (GCNs) and video multi-modality. We first build multi-modal GCNs to explore modality-aware multi-action relations, fed by modality-specific action representation as node features, e.g., spatiotemporal features learned by 3D convolutional neural network (CNN), audio and textual embeddings queried from respective feature lexicons. We then joint both multi-modal CNN-GCN models and multi-modal feature representations for learning better relational action predictions. Ablation study, multi-action relation visualization, and boosts analysis, all show efficacy of our multi-modal multi-action relation modeling. Also our method achieves state-of-the-art performance on large-scale multi-action M-MiT benchmark. Our code is made publicly available at

Related Material

[pdf] [supp]
@InProceedings{Shi_2021_ICCV, author = {Shi, Zhensheng and Liang, Ju and Li, Qianqian and Zheng, Haiyong and Gu, Zhaorui and Dong, Junyu and Zheng, Bing}, title = {Multi-Modal Multi-Action Video Recognition}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2021}, pages = {13678-13687} }