Privileged Knowledge Distillation for Dimensional Emotion Recognition in the Wild
Automated emotion recognition (AER) has a growing number of applications, ranging from behavior analysis in assistive robotics and smart e-learning to depression or pain detection and e-health. Systems for multimodal AER typically outperform unimodal approaches due to the complementary and redundant semantic information across modalities like visual, audio, language, physiological, etc. However, in practice, only a subset of these modalities is available at inference time, and using multiple modalities increases systems complexity. This paper focuses on video- based AER, and aims to enhance the accuracy of unimodal systems by leveraging the Learning Under Privileged In formation (LUPI) paradigm with information from multiple modalities. Without loss of generality, the audio modality is considered as privileged information (only available during training) in this study, and a new multimodal to un modal privileged knowledge distillation (M2PKD) mechanism is introduced. In this paper, the teacher network is comprised of a multimodal model that processes audio visual information and distills the learned knowledge to a unimodal visual student network. We validate our proposed PKD approach on the challenging RECOLA and Affwild2 datasets for video-based AER, using weak and strong baseline AER architectures, as well as joint cross-attention fusion methods. The proposed multimodal PKD method increases the absolute average concordance correlation coefficient accuracy by 8% on RECOLA, and 2% increase in the arousal dimension is observed on Affwild2.