Facial Expression Recognition Based on Multi-Modal Features for Videos in the Wild
This paper presents our work to the Expression Classification Challenge of the 5th Affective Behavior Analysis in-the-wild (ABAW) Competition. In our method, the multimodal features are extracted by several different pertained models, which are used to build different combinations to capture more effective emotion information. Specifically, we extracted efficient facial expression features using MAE encoder pre-trained with a large-scale face dataset. For these combinations of visual and audio modal features, we utilize two kinds of temporal encoders to explore the temporal contextual information in the data. In addition, we employ several ensemble strategies for different experimental settings to obtain the most accurate expression recognition results. Our system achieves the average F1 Score of 0.4072 on the test set of Aff-wild2 ranking 2nd, which proves the effectiveness of our method.