A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition

Praveen, R Gnana; de Melo, Wheidima Carneiro; Ullah, Nasib; Aslam, Haseeb; Zeeshan, Osama; Denorme, Théo; Pedersoli, Marco; Koerich, Alessandro L.; Bacon, Simon; Cardinal, Patrick; Granger, Eric

A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition

R Gnana Praveen, Wheidima Carneiro de Melo, Nasib Ullah, Haseeb Aslam, Osama Zeeshan, Théo Denorme, Marco Pedersoli, Alessandro L. Koerich, Simon Bacon, Patrick Cardinal, Eric Granger; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2022, pp. 2486-2495

Abstract

Multi-modal emotion recognition has recently gained much attention since it can leverage diverse and complementary relationships over multiple modalities, such as audio, visual, and bio-signals. Most state-of-the-art methods for audio-visual (A-V) fusion rely on recurrent networks or conventional attention mechanisms that do not effectively leverage the complementary nature of A-V modalities. In this paper, we focus on dimensional emotion recognition based on the fusion of facial and vocal modalities extracted from videos. Specifically, we propose a joint cross-attention model that relies on the complementary relationships to extract the salient features across A-V modalities, allowing for accurate prediction of continuous values of valence and arousal. The proposed fusion model efficiently leverages the inter-modal relationships, while reducing the heterogeneity between features. In particular, it computes cross-attention weights based on the correlation between joint feature representations, and that of individual modalities. By deploying a joint A-V feature representation into the cross-attention module, the performance of our fusion module improves significantly over the vanilla cross-attention module. Experimental results on the AffWild2 dataset highlight the robustness of our proposed A-V fusion model. It has achieved a concordance correlation coefficient (CCC) of 0.374 (0.663) and 0.363 (0.584) for valence and arousal, respectively, on test set (validation set). This is a significant improvement over the baseline of third challenge of Affective Behavior Analysis in-the-wild (ABAW3) competition, with a CCC of 0.180 (0.310) and 0.170 (0.170).

Related Material

[pdf] [arXiv]

[bibtex]

@InProceedings{Praveen_2022_CVPR, author = {Praveen, R Gnana and de Melo, Wheidima Carneiro and Ullah, Nasib and Aslam, Haseeb and Zeeshan, Osama and Denorme, Th\'eo and Pedersoli, Marco and Koerich, Alessandro L. and Bacon, Simon and Cardinal, Patrick and Granger, Eric}, title = {A Joint Cross-Attention Model for Audio-Visual Fusion in Dimensional Emotion Recognition}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2022}, pages = {2486-2495} }