Multi-Task Multi-Modal Self-Supervised Learning for Facial Expression Recognition

Marah Halawa, Florian Blume, Pia Bideau, Martin Maier, Rasha Abdel Rahman, Olaf Hellwich; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 4604-4614

Abstract


Human communication is multi-modal; e.g. face-to-face interaction involves auditory signals (speech) and visual signals (face movements and hand gestures). Hence it is essential to exploit multiple modalities when designing machine learning-based facial expression recognition systems. In addition given the ever-growing quantities of video data that capture human facial expressions such systems should utilize raw unlabeled videos without requiring expensive annotations. Therefore in this work we employ a multi- task multi-modal self-supervised learning method for facial expression recognition from in-the-wild video data. Our model combines three self-supervised objective functions: First a multi-modal contrastive loss that pulls diverse data modalities of the same video together in the representation space. Second a multi-modal clustering loss that pre- serves the semantic structure of input data in the representation space. Finally a multi-modal data reconstruc- tion loss. We conduct a comprehensive study on this multi- modal multi-task self-supervised learning method on three facial expression recognition benchmarks. To that end we examine the performance of learning through different combinations of self-supervised tasks on the facial expression recognition downstream task. Our model ConCluGen out- performs several multi-modal self-supervised and fully supervised baselines on the CMU-MOSEI dataset. Our results generally show that multi-modal self-supervision tasks offer large performance gains for challenging tasks such as facial expression recognition while also reducing the amount of manual annotations required. We release our pre-trained models as well as source code publicly.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Halawa_2024_CVPR, author = {Halawa, Marah and Blume, Florian and Bideau, Pia and Maier, Martin and Rahman, Rasha Abdel and Hellwich, Olaf}, title = {Multi-Task Multi-Modal Self-Supervised Learning for Facial Expression Recognition}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {4604-4614} }