DuST: Dual Swin Transformer for Multi-modal Video and Time-Series Modeling

Liang Shi, Yixin Chen, Meimei Liu, Feng Guo; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 4537-4546

Abstract


This paper proposes a novel DuST: Dual Swin Transformer model integrating video with synchronous time-series data in the context of driving risk assessment. The DuST model utilizes the Swin Transformer architecture for feature extraction from both modalities. Specifically a Video Swin Transformer is adopted for video and a 1D Swin Transformer for time-series data. The hierarchical structure and window-based multi-head self-attention in Swin Transformers effectively capture both local and global features. A comparison of multiple fusion methods confirmed that the tailored stagewise fusion process leads to enhanced model performance by effectively capturing complementary information from multimodal data. The approach was applied to the Second Strategic Highway Research Program Naturalistic Driving Study data for classifying crashes tire strikes near-crashes and normal driving segments using front-view videos and triaxial acceleration data. The innovative multi-modal method demonstrates superior classification performance highlighting its potential for video-time-series modeling in critical applications such as advanced driver assistance systems and automated driving systems. The code for the proposed framework is available at https://github.com/datadrivenwheels/DUST

Related Material


[pdf]
[bibtex]
@InProceedings{Shi_2024_CVPR, author = {Shi, Liang and Chen, Yixin and Liu, Meimei and Guo, Feng}, title = {DuST: Dual Swin Transformer for Multi-modal Video and Time-Series Modeling}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {4537-4546} }