Learning Generalized Feature for Temporal Action Detection: Application for Natural Driving Action Recognition Challenge
This paper reports our approach for the 2022 AI City Challenge - Naturalistic Driving Action Recognition (Track 3), where the objective is to detect when and what kinds of actions that a driver performs in a long, untrimmed video. Our solution is built upon the single stage ActionFormer detector, in which temporal location and classification are predicted simultaneously for efficiency. The input feature for the detector is extracted offline using our proposed backbone, which we named "ConvNext-Video". However, due to the small size of the dataset, training the model to avoid over-fitting becomes challenging. To address this problem, we focus on training techniques that can improve the generalization of underlying features. Specifically, we utilize two methods: "learning without forgetting" and semi-weak supervised learning on the unlabeled data A2. Finally, we also add a second-stage classifier (SSC) using our ConvNeXt-Video backbone. The SSC Classifer is designed to combine information from multi-clips and multi-view cameras to improve the prediction precision. Our best result achieves 29.1 F1 score on the public test set. Our source code is released at \href https://github.com/cybercore-co-ltd/AICity2022-Track3 link .