MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions

Yunfei Liu, Lijian Lin, Fei Yu, Changyin Zhou, Yu Li; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 23020-23029

Abstract


Audio-driven portrait animation aims to synthesize portrait videos that are conditioned by given audio. Animating high-fidelity and multimodal video portraits has a variety of applications. Previous methods have attempted to capture different motion modes and generate high-fidelity portrait videos by training different models or sampling signals from given videos. However, lacking correlation learning between lip-sync and other movements (e.g., head pose/eye blinking) usually leads to unnatural results. In this paper, we propose a unified system for multi-person, diverse, and high-fidelity talking portrait generation. Our method contains three stages, i.e., 1) Mapping-Once network with Dual Attentions (MODA) generates talking representation from given audio. In MODA, we design a dual-attention module to encode accurate mouth movements and diverse modalities. 2) Facial composer network generates dense and detailed face landmarks, and 3) temporal-guided render syntheses stable videos. Extensive evaluations demonstrate that the proposed system produces more natural and realistic video portraits compared to previous methods.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Liu_2023_ICCV, author = {Liu, Yunfei and Lin, Lijian and Yu, Fei and Zhou, Changyin and Li, Yu}, title = {MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {23020-23029} }