Leveraging Lightweight Facial Models and Textual Modality in Audio-visual Emotional Understanding in-the-Wild

Andrey Savchenko, Lyudmila Savchenko; Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops, 2025, pp. 5778-5788

Abstract


This article presents our results for the eighth Affective Behavior Analysis in-the-Wild (ABAW) competition. We combine facial emotional descriptors extracted by lightweight pre-trained models from our EmotiEffLib library with acoustic features and embeddings of texts recognized from speech. The frame-level features are aggregated and fed into simple classifiers, e.g., multi-layered perceptron (feed-forward neural network with one hidden layer), to predict ambivalence/hesitancy and facial expressions. In the latter case, we also use the pre-trained facial expression recognition model to select high-score video frames and prevent their processing with a domain-specific video classifier. The video-level prediction of emotional mimicry intensity is implemented by simply aggregating frame-level features and training a multi-layered perceptron. Experimental results for four tasks from the ABAW challenge demonstrate that our approach significantly increases validation metrics compared to existing baselines. As a result, our solutions took first place in the expression classification and Ambivalence/Hesitancy recognition challenges, and third place in emotional mimicry intensity estimation and action unit detection tasks.

Related Material


[pdf]
[bibtex]
@InProceedings{Savchenko_2025_CVPR, author = {Savchenko, Andrey and Savchenko, Lyudmila}, title = {Leveraging Lightweight Facial Models and Textual Modality in Audio-visual Emotional Understanding in-the-Wild}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops}, month = {June}, year = {2025}, pages = {5778-5788} }