TrafficInternVL: Understanding Traffic Scenarios with Vision-Language Models

Wu, Hsiu-Fu; Yang, Ya-Ting; Chen, Yung-Ter; Chou, I-Fan

Hsiu-Fu Wu, Ya-Ting Yang, Yung-Ter Chen, I-Fan Chou; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2025, pp. 5288-5295

Abstract

Accurate description and analysis of traffic safety scenarios is essential for traffic safety assessment and the development of AI-driven traffic analysis systems. While recent vision-language models (VLMs) perform well on general benchmarks, they often fail to capture the complex spatial-temporal dynamics and causal reasoning required in safety-critical domains. To address this challenge, we propose TrafficInternVL, a structured fine-tuning framework for open-source VLMs, tailored to traffic safety description and analysis, with a focus on pedestrian-vehicle interactions. Our approach integrates keyframe-based global-local view construction, role-aware multimodal prompt design, representative sample selection, and joint QA-VQA supervision. By treating each question as an independent reasoning session, the model avoids reliance on dialogue history and achieves stable performance across diverse scenarios. Evaluated on the WTS dataset in the 2025 AI City Challenge Track 2, TrafficInternVL achieves the top official score (60.0393). The code will be released soon.

Related Material

[pdf]

[bibtex]

@InProceedings{Wu_2025_ICCV, author = {Wu, Hsiu-Fu and Yang, Ya-Ting and Chen, Yung-Ter and Chou, I-Fan}, title = {TrafficInternVL: Understanding Traffic Scenarios with Vision-Language Models}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {5288-5295} }