SkeleTR: Towards Skeleton-based Action Recognition in the Wild

Haodong Duan, Mingze Xu, Bing Shuai, Davide Modolo, Zhuowen Tu, Joseph Tighe, Alessandro Bergamo; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 13634-13644

Abstract


We present SkeleTR, a new framework for skeleton-based action recognition. In contrast to prior work, which focuses mainly on controlled environments, we target in-the-wild scenarios that typically involve a variable number of people and various forms of interaction between people. SkeleTR works with a two-stage paradigm. It first models the intra-person skeleton dynamics for each skeleton sequence with graph convolutions, and then uses stacked Transformer encoders to capture person interactions that are important for action recognition in the wild. To mitigate the negative impact of inaccurate skeleton associations, SkeleTR takes relative short skeleton sequences as input and increases the number of sequences. As a unified solution, SkeleTR can be directly applied to multiple skeleton-based action tasks, including video-level action classification, instance-level action detection, and group-level activity recognition. It also enables transfer learning and joint training across different action tasks and datasets, which results in performance improvement. When evaluated on various skeleton-based action recognition benchmarks, SkeleTR achieves the state-of-the-art performance.

Related Material


[pdf]
[bibtex]
@InProceedings{Duan_2023_ICCV, author = {Duan, Haodong and Xu, Mingze and Shuai, Bing and Modolo, Davide and Tu, Zhuowen and Tighe, Joseph and Bergamo, Alessandro}, title = {SkeleTR: Towards Skeleton-based Action Recognition in the Wild}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {13634-13644} }