Multi-Attention Transformer for Naturalistic Driving Action Recognition

Xiaodong Dong, Ruijie Zhao, Hao Sun, Dong Wu, Jin Wang, Xuyang Zhou, Jiang Liu, Shun Cui, Zhongjiang He; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2023, pp. 5435-5441

Abstract


To detect the start time and end time of each action in an untrimmed video in the Track 3 of AI City Challenge, this paper proposes a powerful network architecture, Multi-Attention Transformer. The previous methods extract features by setting a fixed sliding window whitch means a fixed time interval, and predict the start and end times of the action. We believe that adopting a series of fixed windows will corrupt the video feature containing contextual information. So we present a Multi-Attention transformer module which combines the local window attention and global attention to fix this problem. The method equipped with features provided by VideoMAE achieved a score of 66.34. Then use the time correction module to improve the score to 67.23 on validation set A2. Finally, we have achieved third place on Track3 A2 dataset of the AI City Challenge 2023.

Related Material


[pdf]
[bibtex]
@InProceedings{Dong_2023_CVPR, author = {Dong, Xiaodong and Zhao, Ruijie and Sun, Hao and Wu, Dong and Wang, Jin and Zhou, Xuyang and Liu, Jiang and Cui, Shun and He, Zhongjiang}, title = {Multi-Attention Transformer for Naturalistic Driving Action Recognition}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2023}, pages = {5435-5441} }