Ensemble Spatial and Temporal Vision Transformer for Action Units Detection

Ngoc Tu Vu, Van Thong Huynh, Trong Nghia Nguyen, Soo-Hyung Kim; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2023, pp. 5770-5776

Abstract


Facial Action Units detection (FAUs) represents a fine-grained classification problem that involves identifying different units on the human face, as defined by the Facial Action Coding System. In this paper, we present a simple yet efficient Vision Transformer-based approach for addressing the task of Action Units (AU) detection in the context of Affective Behavior Analysis in-the-wild (ABAW) competition. We employ the Video Vision Transformer(ViViT) Network to capture the temporal facial change in the video. Besides, to reduce massive size of the Vision Transformers model, we replace the ViViT feature extraction layers with the CNN backbone (Regnet). Our model outperform the baseline model of ABAW 2023 challenge, with a notable 14% difference in result. Our team has achieved a position within the top five teams in the ABAW 2023 competition, ranking slightly below the top three and four teams by a narrow margin of 0.27% and 0.43%, respectively.

Related Material


[pdf]
[bibtex]
@InProceedings{Vu_2023_CVPR, author = {Vu, Ngoc Tu and Huynh, Van Thong and Nguyen, Trong Nghia and Kim, Soo-Hyung}, title = {Ensemble Spatial and Temporal Vision Transformer for Action Units Detection}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2023}, pages = {5770-5776} }