AniMo: Species-Aware Model for Text-Driven Animal Motion Generation

Xuan Wang, Kai Ruan, Xing Zhang, Gaoang Wang; Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 1929-1939

Abstract


Text-driven motion generation has made significant strides in recent years. However, most existing works focus on human motion, largely overlooking the rich and diverse behaviors of animals. Understanding and synthesizing animal motion have important applications in wildlife conservation, animal ecology, and biomechanics. Animal motion modeling presents unique challenges due to species diversity, varied morphological structures, and different behavioral patterns in response to similar textual descriptions. To address these challenges, we propose AniMo for text-driven animal motion generation. AniMo consists of two stages: motion tokenization and text-to-motion generation. In the motion tokenization stage, we encode motions using a joint-aware spatiotemporal encoder with species-aware feature modulation, enabling the model to adapt to diverse skeletal structures across species. In the text-to-motion generation stage, we employ masked modeling to jointly learn the mappings between textual descriptions and motion tokens. Additionally, we introduce AniMo4D, a large-scale dataset containing 78,149 motion sequences and 185,435 textual descriptions across 114 animal species. Experimental results show that AniMo achieves superior performance on both the AniMo4D and AnimalML3D datasets, effectively capturing diverse morphological structures and behavioral patterns across animal species.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Wang_2025_CVPR, author = {Wang, Xuan and Ruan, Kai and Zhang, Xing and Wang, Gaoang}, title = {AniMo: Species-Aware Model for Text-Driven Animal Motion Generation}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {1929-1939} }