Continual Learning for Personalized Co-speech Gesture Generation

Chaitanya Ahuja, Pratik Joshi, Ryo Ishii, Louis-Philippe Morency; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 20893-20903


Co-speech gestures are a key channel of human communication, making them important for personalized chat agents to generate. In the past, gesture generation models assumed that data for each speaker is available all at once, and in large amounts. However in practical scenarios, speaker data comes sequentially and in small amounts as the agent personalizes with more speakers, akin to a continual learning paradigm. While more recent works have shown progress in adapting to low-resource data, they catastrophically forget the gesture styles of initial speakers they were trained on. Also, prior generative continual learning works are not multimodal, making this space less studied. In this paper, we explore this new paradigm and propose C-DiffGAN: an approach that continually learns new speaker gesture styles with only a few minutes of per-speaker data, while retaining previously learnt styles. Inspired by prior continual learning works, C-DiffGAN encourages knowledge retention by 1) generating reminiscences of previous low-resource speaker data, then 2) crossmodally aligning to them to mitigate catastrophic forgetting. We quantitatively demonstrate improved performance and reduced forgetting over strong baselines through standard continual learning measures, reinforced by a qualitative user study that shows that our method produces more natural, style-preserving gestures. Code and videos can be found at

Related Material

[pdf] [supp]
@InProceedings{Ahuja_2023_ICCV, author = {Ahuja, Chaitanya and Joshi, Pratik and Ishii, Ryo and Morency, Louis-Philippe}, title = {Continual Learning for Personalized Co-speech Gesture Generation}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {20893-20903} }