EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

Chen, Kai; Gou, Yunhao; Huang, Runhui; Liu, Zhili; Tan, Daxin; Xu, Jing; Wang, Chunwei; Zhu, Yi; Zeng, Yihan; Yang, Kuo; Wang, Dingdong; Xiang, Kun; Li, Haoyuan; Bai, Haoli; Han, Jianhua; Li, Xiaohui; Jin, Weike; Xie, Nian; Zhang, Yu; Kwok, James T.; Zhao, Hengshuang; Liang, Xiaodan; Yeung, Dit-Yan; Chen, Xiao; Li, Zhenguo; Zhang, Wei; Liu, Qun; Hong, Lanqing; Hou, Lu; Xu, Hang

Kai Chen, Yunhao Gou, Runhui Huang, Zhili Liu, Daxin Tan, Jing Xu, Chunwei Wang, Yi Zhu, Yihan Zeng, Kuo Yang, Dingdong Wang, Kun Xiang, Haoyuan Li, Haoli Bai, Jianhua Han, Xiaohui Li, Weike Jin, Nian Xie, Yu Zhang, James T. Kwok, Hengshuang Zhao, Xiaodan Liang, Dit-Yan Yeung, Xiao Chen, Zhenguo Li, Wei Zhang, Qun Liu, Lanqing Hong, Lu Hou, Hang Xu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 5455-5466

Abstract

GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging for the open-source community. Existing vision-language models rely on external tools for speech processing, while speech-language models still suffer from limited or totally without vision-understanding capabilities. To address this gap, we propose the EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech abilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we surprisingly notice that omni-modal alignment can further enhance vision-language and speech abilities compared with the bi-modal aligned counterparts. Moreover, a lightweight style module is introduced for the flexible speech style controls including emotions and pitches. For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Chen_2025_CVPR, author = {Chen, Kai and Gou, Yunhao and Huang, Runhui and Liu, Zhili and Tan, Daxin and Xu, Jing and Wang, Chunwei and Zhu, Yi and Zeng, Yihan and Yang, Kuo and Wang, Dingdong and Xiang, Kun and Li, Haoyuan and Bai, Haoli and Han, Jianhua and Li, Xiaohui and Jin, Weike and Xie, Nian and Zhang, Yu and Kwok, James T. and Zhao, Hengshuang and Liang, Xiaodan and Yeung, Dit-Yan and Chen, Xiao and Li, Zhenguo and Zhang, Wei and Liu, Qun and Hong, Lanqing and Hou, Lu and Xu, Hang}, title = {EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2025}, pages = {5455-5466} }