LiveGesture: Streamable Co-Speech Gesture Generation Model

Saleem, Muhammad Usama; Patel, Mayur Jagdishbhai; Pinyoanuntapong, Ekkasit; Qin, Zhongxing; Yang, Li; Xue, Hongfei; Helmy, Ahmed; Chen, Chen; Wang, Pu

Muhammad Usama Saleem, Mayur Jagdishbhai Patel, Ekkasit Pinyoanuntapong, Zhongxing Qin, Li Yang, Hongfei Xue, Ahmed Helmy, Chen Chen, Pu Wang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 2264-2273

Abstract

We propose LiveGesture, the first fully streamable, speech-driven full-body gesture generation framework that operates with zero look-ahead and supports arbitrary sequence length. Unlike existing co-speech gesture methods--which are designed for offline generation and either treat body regions independently or entangle all joints within a single model--LiveGesture is built from the ground up for causal, region-coordinated motion generation. LiveGesture consists of two main modules: the Streamable Vector-Quantized Motion Tokenizer (SVQ) and the Hierarchical Autoregressive Transformer (HAR). The SVQ tokenizer converts the motion sequence of each body region into causal, discrete motion tokens, enabling real-time, streamable token decoding. On top of SVQ, HAR employs region-eXpert autoregressive (xAR) transformers to model expressive, fine-grained motion dynamics for each body region. A causal spatio-temporal fusion module (xAR-Fusion) then captures and integrates correlated motion dynamics across regions. Both xAR and xAR-Fusion are conditioned on live, continuously arriving audio signals encoded by a streamable causal audio encoder. To enhance robustness under streaming noise and prediction errors, we introduce autoregressive masking training, which leverages uncertainty-guided token masking and random region masking to expose the model to imperfect, partially erroneous histories during training. Experiments on the BEAT2 dataset demonstrate that LiveGesture produces coherent, diverse, and beat-synchronous full-body gestures in real time, matching or surpassing state-of-the-art offline methods under true zero-look-ahead conditions.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Saleem_2026_CVPR, author = {Saleem, Muhammad Usama and Patel, Mayur Jagdishbhai and Pinyoanuntapong, Ekkasit and Qin, Zhongxing and Yang, Li and Xue, Hongfei and Helmy, Ahmed and Chen, Chen and Wang, Pu}, title = {LiveGesture: Streamable Co-Speech Gesture Generation Model}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {2264-2273} }