Spatial Temporal Network for Image and Skeleton Based Group Activity Recognition

Xiaolin Zhai, Zhengxi Hu, Dingye Yang, Lei Zhou, Jingtai Liu; Proceedings of the Asian Conference on Computer Vision (ACCV), 2022, pp. 20-38


Group activity recognition aims to infer group activity in multi-person scenes. Previous methods usually model inter-person relations and integrate individuals' features into group representations. However, they neglect intra-person relations contained in the human skeleton. Individual representations can also be inferred by analyzing the evolution of human skeletons. In this paper, we utilize RGB images and human skeletons as the inputs which contain complementary information. Considering different semantic attributes of the two inputs, we design two diverse branches, respectively. For RGB images, we propose Scene Encoded Transformer, Spatial Transformer, and Temporal Transformer to explore inter-person spatial and temporal relations. For skeleton inputs, we capture the intra-person spatial and temporal dynamics by designing Spatial and Temporal GCN. Our main contributions are: i) we propose a spatial-temporal network with two branches for group activity recognition utilizing RGB images and human skeletons. Experiments show that our model achieves 97.1 MCA and 96.1 MPCA on the Collective Activity dataset and 94.0 MCA and 94.4 MPCA on the Volleyball dataset. ii) we extend the two datasets by introducing human skeleton annotations, namely human joint coordinates and confidence, which can also be used in the action recognition task. The code is available at

Related Material

[pdf] [code]
@InProceedings{Zhai_2022_ACCV, author = {Zhai, Xiaolin and Hu, Zhengxi and Yang, Dingye and Zhou, Lei and Liu, Jingtai}, title = {Spatial Temporal Network for Image and Skeleton Based Group Activity Recognition}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2022}, pages = {20-38} }