Multi-Attention Transformer for Naturalistic Driving Action Recognition
To detect the start time and end time of each action in an untrimmed video in the Track 3 of AI City Challenge, this paper proposes a powerful network architecture, Multi-Attention Transformer. The previous methods extract features by setting a fixed sliding window whitch means a fixed time interval, and predict the start and end times of the action. We believe that adopting a series of fixed windows will corrupt the video feature containing contextual information. So we present a Multi-Attention transformer module which combines the local window attention and global attention to fix this problem. The method equipped with features provided by VideoMAE achieved a score of 66.34. Then use the time correction module to improve the score to 67.23 on validation set A2. Finally, we have achieved third place on Track3 A2 dataset of the AI City Challenge 2023.