Multi-View Spatial-Temporal Learning for Understanding Unusual Behaviors in Untrimmed Naturalistic Driving Videos

Huy-Hung Nguyen, Chi Dai Tran, Long Hoang Pham, Duong Nguyen-Ngoc Tran, Tai Huu-Phuong Tran, Duong Khac Vu, Quoc Pham-Nam Ho, Ngoc Doan-Minh Huynh, Hyung-Min Jeon, Hyung-Joon Jeon, Jae Wook Jeon; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 7144-7152


The task of Naturalistic Driving Action Recognition aims to detect and temporally localize distracting driving behavior in untrimmed videos. Prior studies have demonstrated that video action recognition using self-supervised video pre-training can outperforms contrastive learning-based pre-training methods without using extra data even on relatively small-scale video datasets. In this paper we introduce our framework for Track 3 of the 8th AI City Challenge in 2024. The approach is primarily based on large model fine-tuning and ensemble techniques to train a set of action recognition models on a small-scale dataset. Starting with raw videos we segment them into individual action sequences based on their annotation. We then fine-tune four different action recognition models with K-fold cross-validation applied to the segmented data. Following this we execute a multi-view ensemble selecting the most visible camera views for each action class to generate clip-level classification results for each video. Finally a multi-step post-processing algorithm which is designed for the AI City Challenge dataset's specific features is employed to perform temporal action localization and produce temporal segments for the actions. Our solution achieves a final score of 0.7798 and attains the 5th rank on the public leaderboard for the test set A2 of the challenge. The source code will be publicly available at

Related Material

@InProceedings{Nguyen_2024_CVPR, author = {Nguyen, Huy-Hung and Tran, Chi Dai and Pham, Long Hoang and Tran, Duong Nguyen-Ngoc and Tran, Tai Huu-Phuong and Vu, Duong Khac and Ho, Quoc Pham-Nam and Huynh, Ngoc Doan-Minh and Jeon, Hyung-Min and Jeon, Hyung-Joon and Jeon, Jae Wook}, title = {Multi-View Spatial-Temporal Learning for Understanding Unusual Behaviors in Untrimmed Naturalistic Driving Videos}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {7144-7152} }