Multimodal Continuous Emotion Recognition: A Technical Report for ABAW5

Su Zhang, Ziyuan Zhao, Cuntai Guan; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2023, pp. 5764-5769


We used two multimodal models for continuous valence-arousal recognition using visual, audio, and linguistic information. The first model is the same as we used in ABAW2 and ABAW3, which employs the leader-follower attention. The second model has the same architecture for spatial and temporal encoding. As for the fusion block, it employs a compact and straightforward channel attention, borrowed from the End2You toolkit. Unlike our previous attempts that use Vggish feature directly as the audio feature, this time we feed the pre-trained VGG model using logmel-spectrogram and finetune it during the training. To make full use of the data and alleviate over-fitting, cross-validation is carried out. The code is available at

Related Material

[pdf] [arXiv]
@InProceedings{Zhang_2023_CVPR, author = {Zhang, Su and Zhao, Ziyuan and Guan, Cuntai}, title = {Multimodal Continuous Emotion Recognition: A Technical Report for ABAW5}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2023}, pages = {5764-5769} }