Low-Resource Adaptation for Personalized Co-Speech Gesture Generation

Chaitanya Ahuja, Dong Won Lee, Louis-Philippe Morency; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 20566-20576

Abstract


Personalizing an avatar for co-speech gesture generation from spoken language requires learning the idiosyncrasies of a person's gesture style from a small amount of data. Previous methods in gesture generation require large amounts of data for each speaker, which is often infeasible. We propose an approach, named DiffGAN, that efficiently personalizes co-speech gesture generation models of a high-resource source speaker to target speaker with just 2 minutes of target training data. A unique characteristic of DiffGAN is its ability to account for the crossmodal grounding shift, while also addressing the distribution shift in the output domain. We substantiate the effectiveness of our approach a large scale publicly available dataset through quantitative, qualitative and user studies, which show that our proposed methodology significantly outperforms prior approaches for low-resource adaptation of gesture generation. Code and videos can be found at https://chahuja.com/diffgan

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Ahuja_2022_CVPR, author = {Ahuja, Chaitanya and Lee, Dong Won and Morency, Louis-Philippe}, title = {Low-Resource Adaptation for Personalized Co-Speech Gesture Generation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2022}, pages = {20566-20576} }