Learning Video Representations of Human Motion From Synthetic Data

Xi Guo, Wei Wu, Dongliang Wang, Jing Su, Haisheng Su, Weihao Gan, Jian Huang, Qin Yang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 20197-20207

Abstract


In this paper, we take an early step towards video representation learning of human actions with the help of largescale synthetic videos, particularly for human motion representation enhancement. Specifically, we first introduce an automatic action-related video synthesis pipeline based on a photorealistic video game. A large-scale human action dataset named GATA (GTA Animation Transformed Actions) is then built by the proposed pipeline, which includes 8.1 million action clips spanning over 28K action classes. Based on the presented dataset, we design a contrastive learning framework for human motion representation learning, which shows significant performance improvements on several typical video datasets for action recognition, e.g., Charades, HAA 500 and NTU-RGB. Besides, we further explore a domain adaptation method based on cross-domain positive pairs mining to alleviate the domain gap between synthetic and realistic data. Extensive properties analyses of learned representation are conducted to demonstrate the effectiveness of the proposed dataset for enhancing human motion representation learning.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Guo_2022_CVPR, author = {Guo, Xi and Wu, Wei and Wang, Dongliang and Su, Jing and Su, Haisheng and Gan, Weihao and Huang, Jian and Yang, Qin}, title = {Learning Video Representations of Human Motion From Synthetic Data}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2022}, pages = {20197-20207} }