Video Representation Learning for Conversational Facial Expression Recognition Guided by Multiple View Reconstruction

Strizhkova, Valeriya; Ferrari, Laura M.; Kachmar, Hadi; Dantcheva, Antitza; Bremond, Francois

Valeriya Strizhkova, Laura M. Ferrari, Hadi Kachmar, Antitza Dantcheva, Francois Bremond; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 4693-4702

Abstract

Conversational facial expression recognition entails challenges such as handling of facial dynamics small available datasets low-intensity and fine-grained emotional expressions and extreme face angle. Towards addressing these challenges we propose the Masking Action Units and Reconstructing multiple Angles (MAURA) pre-training. MAURA is an efficient self-supervised method that permits the use of small datasets while preserving end-to-end conversational facial expression recognition with Vision Transformer. MAURA masks videos using the location with active Action Units and reconstructs synchronized multi-view videos thus learning the dependencies between muscle movements and encoding information which might only be visible in few frames and/or in certain views. Based on one view (e.g. frontal) the encoder reconstructs other views (e.g. top down laterals). Such masking and reconstructing strategy provides a powerful representation beneficial in facial expression downstream tasks. Our experimental analysis shows that we consistently outperform the state-of-the-art in the challenging settings of low-intensity and fine-grained conversational facial expression recognition on four datasets including in-the-wild DFEW CMU-MOSEI MFA and multi-view MEAD. Our results suggest that MAURA is able to learn robust and generic video representations.

Related Material

[pdf]

[bibtex]

@InProceedings{Strizhkova_2024_CVPR, author = {Strizhkova, Valeriya and Ferrari, Laura M. and Kachmar, Hadi and Dantcheva, Antitza and Bremond, Francois}, title = {Video Representation Learning for Conversational Facial Expression Recognition Guided by Multiple View Reconstruction}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {4693-4702} }