Multi-perspective Traffic Video Description Model with Fine-grained Refinement Approach

Tuan-An To, Minh-Nam Tran, Trong-Bao Ho, Thien-Loc Ha, Quang-Tan Nguyen, Hoang-Chau Luong, Thanh-Duy Cao, Minh-Triet Tran; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 7075-7084

Abstract


The analysis of traffic patterns is crucial for enhancing safety and optimizing flow within urban cities. While urban cities possess extensive camera networks for monitoring the raw video data often lacks the contextual detail necessary for understanding complex traffic incidents and the behaviors of road users. This paper proposes a novel methodology for generating comprehensive descriptions of traffic scenarios combining a vision-language model (VLM) with rule-based refinements to capture pertinently pedestrian vehicle and environment factors. First a captioning model will generate a general description using processed video as input. Subsequently this description is refined sequentially through three primary modules: pedestrian-aware vehicle-aware and context-aware enhancing the final description. We evaluate our method on the Woven Traffic Safety datasets in Track 2 of the AI City Challenge 2024 obtaining competitive results with an S2 score of 22.6721. Code will be available at https://github.com/ToTuanAn/AICityChallenge2024_Track2

Related Material


[pdf]
[bibtex]
@InProceedings{To_2024_CVPR, author = {To, Tuan-An and Tran, Minh-Nam and Ho, Trong-Bao and Ha, Thien-Loc and Nguyen, Quang-Tan and Luong, Hoang-Chau and Cao, Thanh-Duy and Tran, Minh-Triet}, title = {Multi-perspective Traffic Video Description Model with Fine-grained Refinement Approach}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {7075-7084} }