Multi-Task Learning for Human Affect Prediction With Auditory-Visual Synchronized Representation

Euiseok Jeong, Geesung Oh, Sejoon Lim; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2022, pp. 2438-2445

Abstract


With the development of the big data and deep learning technologies, research on predicting human affects in the wild using deep neural networks is being actively conducted. Many researchers use image and audio together to improve the affect prediction performance. However, the synchronization between image and audio data has not yet been achieved. Moreover, many different ways can be employed to annotate human affects, and the annotations in many datasets are not identical. The data cannot be utilized in supervised learning without the annotation of the task to be predicted. This study proposes a multi-task human affect prediction model with multimodal input and knowledge distillation to address the abovementioned problems. We used SoundNet, which was trained to transfer visual knowledge into auditory representations, to extract synchronized auditory-visual representations. Knowledge distillation was applied to utilize all datasets with incomplete labels. This model used image and audio data to predict the valence-arousal, expression, and action units and was validated using the Aff-Wild2 dataset. When auditory-visual synchronized representation was used, the performance improved by 11.83% and 230.16%, respectively, compared to when visual or auditory representation was used alone. When knowledge distillation was applied, the performance improved by 15.38% compared to when it was not. Consequently, the proposed model achieved a 0.95 performance for the multi-task learning task on the Aff-Wild2 test dataset. This performance is equivalent to that of the second place in the 3rd Affective Behavior Analysis in the wild Multi-task Learning Challenge.

Related Material


[pdf]
[bibtex]
@InProceedings{Jeong_2022_CVPR, author = {Jeong, Euiseok and Oh, Geesung and Lim, Sejoon}, title = {Multi-Task Learning for Human Affect Prediction With Auditory-Visual Synchronized Representation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2022}, pages = {2438-2445} }