-
[pdf]
[arXiv]
[bibtex]@InProceedings{Kyem_2025_ICCV, author = {Kyem, Blessing Agyei and Owor, Neema J. and Danyo, Andrews and Asamoah, Joshua K. and Denteh, Eugene and Muturi, Tanner and Dontoh, Anthony and Adu-Gyamfi, Yaw and Aboah, Armstrong}, title = {Task-Specific Dual-Model Framework for Comprehensive Traffic Safety Video Description and Analysis}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {5384-5392} }
Task-Specific Dual-Model Framework for Comprehensive Traffic Safety Video Description and Analysis
Abstract
Traffic safety analysis requires complex video understanding to capture fine-grained behavioral patterns and generate comprehensive descriptions for accident prevention. We present a unique dual-model framework that strategically utilizes the complementary strengths of VideoLLaMA and Qwen2.5-VL through task-specific optimization. Our key insight is that separate training on Captioning and Visual Question Answering (VQA) tasks prevents task interference while maximizing each model's specialized capabilities. VideoLLaMA excels in temporal reasoning (CIDEr: 1.1001), while Qwen2.5-VL demonstrates superior visual understanding (VQA Accuracy: 60.80%). Through extensive experiments on the WTS dataset, our method achieves an S2 score of 46.8255 in the 2025 AI City Challenge Track 2, placing 10th on the challenge leaderboard and improving 1.09 points over single-model baselines. Ablation studies validate that our separate training strategy outperforms joint training by 8.6% in VQA accuracy while maintaining captioning quality.
Related Material
