Joint Multimodal Transformer for Emotion Recognition in the Wild

Paul Waligora, Muhammad Haseeb Aslam, Muhammad Osama Zeeshan, Soufiane Belharbi, Alessandro Lameiras Koerich, Marco Pedersoli, Simon Bacon, Eric Granger; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 4625-4635

Abstract


Multimodal emotion recognition (MMER) systems typically outperform unimodal systems by leveraging the inter and intra-modal relationships between e.g. visual textual physiological and auditory modalities. This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based crossattention. This framework can exploit the complementary nature of diverse modalities to improve predictive accuracy. Separate backbones capture intra-modal spatiotemporal dependencies within each modality over video sequences. Subsequently our JMT fusion architecture integrates the individual modality embeddings allowing the model to effectively capture inter- and intra-modal relationships. Extensive experiments on two challenging expression recognition tasks (1) dimensional emotion recognition on the Affwild2 dataset (with face and voice) and (2) pain estimation on the Biovid dataset (with face and biosensors) indicate that our JMT fusion can provide a cost-effective solution for MMER. Empirical results show that MMER systems with our proposed fusion allow us to outperform relevant baseline and state-of-the-art methods.

Related Material


[pdf] [arXiv]
[bibtex]
@InProceedings{Waligora_2024_CVPR, author = {Waligora, Paul and Aslam, Muhammad Haseeb and Zeeshan, Muhammad Osama and Belharbi, Soufiane and Koerich, Alessandro Lameiras and Pedersoli, Marco and Bacon, Simon and Granger, Eric}, title = {Joint Multimodal Transformer for Emotion Recognition in the Wild}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {4625-4635} }