-
[pdf]
[bibtex]@InProceedings{Savchenko_2025_CVPR, author = {Savchenko, Andrey and Savchenko, Lyudmila}, title = {Leveraging Lightweight Facial Models and Textual Modality in Audio-visual Emotional Understanding in-the-Wild}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops}, month = {June}, year = {2025}, pages = {5778-5788} }
Leveraging Lightweight Facial Models and Textual Modality in Audio-visual Emotional Understanding in-the-Wild
Abstract
This article presents our results for the eighth Affective Behavior Analysis in-the-Wild (ABAW) competition. We combine facial emotional descriptors extracted by lightweight pre-trained models from our EmotiEffLib library with acoustic features and embeddings of texts recognized from speech. The frame-level features are aggregated and fed into simple classifiers, e.g., multi-layered perceptron (feed-forward neural network with one hidden layer), to predict ambivalence/hesitancy and facial expressions. In the latter case, we also use the pre-trained facial expression recognition model to select high-score video frames and prevent their processing with a domain-specific video classifier. The video-level prediction of emotional mimicry intensity is implemented by simply aggregating frame-level features and training a multi-layered perceptron. Experimental results for four tasks from the ABAW challenge demonstrate that our approach significantly increases validation metrics compared to existing baselines. As a result, our solutions took first place in the expression classification and Ambivalence/Hesitancy recognition challenges, and third place in emotional mimicry intensity estimation and action unit detection tasks.
Related Material