Semi-supervised Speech-driven 3D Facial Animation via Cross-modal Encoding

Peiji Yang, Huawei Wei, Yicheng Zhong, Zhisheng Wang; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 21032-21041

Abstract


Existing Speech-driven 3D facial animation methods typically follow the supervised paradigm, involving regression from speech to 3D facial animation. This paradigm faces two major challenges: the high cost of supervision acquisition, and the ambiguity in mapping between speech and lip movements. To address these challenges, this study proposes a novel cross-modal semi-supervised framework, comprising a Speech-to-Image Transcoder and a Face-to-Geometry Regressor. The former jointly learns a common representation space from speech and image domains, enabling the transformation of speech into semantically-consistent facial images. The latter is responsible for reconstructing 3D facial meshes from the transformed images. Both modules require minimal effort to acquire the necessary training data, thereby obviating the dependence on costly supervised data. Furthermore, the joint learning scheme enables the fusion of intricate visual features into speech encoding, thereby facilitating the transformation of subtle speech variations into nuanced lip movements, ultimately enhancing the fidelity of 3D face reconstructions. Consequently, the ambiguity of the direct mapping of speech-to-animation is significantly reduced, leading to coherent and high-fidelity generation of lip motion. Extensive experiments demonstrate that our approach produces competitive results compared to supervised methods.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Yang_2023_ICCV, author = {Yang, Peiji and Wei, Huawei and Zhong, Yicheng and Wang, Zhisheng}, title = {Semi-supervised Speech-driven 3D Facial Animation via Cross-modal Encoding}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {21032-21041} }