AUD-TGN: Advancing Action Unit Detection with Temporal Convolution and GPT-2 in Wild Audiovisual Contexts

Yu, Jun; Zhang, Zerui; Wei, Zhihong; Zhao, Gongpeng; Cai, Zhongpeng; Wang, Yongqi; Xie, Guochen; Zhu, Jichao; Zhu, Wangyuan; Liu, Qingsong; Liang, Jiaen

Jun Yu, Zerui Zhang, Zhihong Wei, Gongpeng Zhao, Zhongpeng Cai, Yongqi Wang, Guochen Xie, Jichao Zhu, Wangyuan Zhu, Qingsong Liu, Jiaen Liang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 4814-4821

Abstract

Leveraging the synergy of both audio data and visual data is essential for understanding human emotions and behaviors especially in in-the-wild setting. Traditional methods for integrating such multimodal information often stumble leading to less-than-ideal outcomes in the task of facial action unit detection. Addressing these challenges our study introduces a novel approach that synergistically enhances audio-visual data processing. For audio we employ Mel Frequency Cepstral Coefficients (MFCC) and Log-Mel spectrogram features enriched through a pre-trained VGGish network significantly bolstering the audio feature landscape. Concurrently in the visual spectrum we enhance feature extraction using an iResNet model pre-trained on facial datasets thereby improving the robustness and quality of the visual data representation. With this augmented feature set Temporal Convolutional Networks (TCN) are applied to meticulously extract and analyze time-series characteristics within each modality fostering a nuanced understanding of temporal dynamics. The integration of cross-modal information is then achieved through a fine-tuned pre-trained GPT-2 model facilitating sophisticated and context-aware fusion of the multimodal data. This comprehensive approach not only enhances the accuracy of AU detection but also paves the way for a nuanced comprehension of complex emotional and behavioral expressions in real-world scenarios.

Related Material

[pdf]

[bibtex]

@InProceedings{Yu_2024_CVPR, author = {Yu, Jun and Zhang, Zerui and Wei, Zhihong and Zhao, Gongpeng and Cai, Zhongpeng and Wang, Yongqi and Xie, Guochen and Zhu, Jichao and Zhu, Wangyuan and Liu, Qingsong and Liang, Jiaen}, title = {AUD-TGN: Advancing Action Unit Detection with Temporal Convolution and GPT-2 in Wild Audiovisual Contexts}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {4814-4821} }