TrafficInternVL: Spatially-Guided Fine-Tuning with Caption Refinement for Fine-Grained Traffic Safety Captioning and Visual Question Answering

Phimsiri, Sasin; Sunpawatr, Sarut; Cherdchusakulchai, Riu; Kiawjak, Pornprom; Tosawadi, Teepakorn; Tungjitnob, Suchat; Trairattanapa, Visarut; Vatathanavaro, Supawit; Kudisthalert, Wasu; Utintu, Chaitat; Saetan, Worawit; Kongsawat, Nathamon; Borisuitsawat, Phawat; Mahakijdechachai, Kasisdis; Su-Inn, Nitipan; Thamwiwatthana, Ek; Suttichaya, Vasin

Sasin Phimsiri, Sarut Sunpawatr, Riu Cherdchusakulchai, Pornprom Kiawjak, Teepakorn Tosawadi, Suchat Tungjitnob, Visarut Trairattanapa, Supawit Vatathanavaro, Wasu Kudisthalert, Chaitat Utintu, Worawit Saetan, Nathamon Kongsawat, Phawat Borisuitsawat, Kasisdis Mahakijdechachai, Nitipan Su-Inn, Ek Thamwiwatthana, Vasin Suttichaya; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2025, pp. 5299-5306

Abstract

Fine-grained traffic understanding requires both detailed visual descriptions and precise answers to safety-critical questions. We present TrafficInternVL, a framework for fine-grained traffic safety description and question answering, developed for AI City Challenge 2025 Track 2. Our approach is based on the InternVL3-38B vision-language model and integrates four key components: (1) spatially guided visual prompting via bounding-box-based cropping and rendering; (2) Adaptive view selection protocols; (3) low-rank adaptation (LoRA) fine-tuning, updating only 1% of model parameters; and (4) caption refinement for intra-scene consistency. Our model achieves a Caption Score of 32.75 (BLEU-4, METEOR, ROUGE-L, CIDEr averaged) and a VQA accuracy of 83.08 %. Code, prompts, and LoRA weights are released at https://github.com/ARV-MLCORE/TrafficInternVL

Related Material

[pdf]

[bibtex]

@InProceedings{Phimsiri_2025_ICCV, author = {Phimsiri, Sasin and Sunpawatr, Sarut and Cherdchusakulchai, Riu and Kiawjak, Pornprom and Tosawadi, Teepakorn and Tungjitnob, Suchat and Trairattanapa, Visarut and Vatathanavaro, Supawit and Kudisthalert, Wasu and Utintu, Chaitat and Saetan, Worawit and Kongsawat, Nathamon and Borisuitsawat, Phawat and Mahakijdechachai, Kasisdis and Su-Inn, Nitipan and Thamwiwatthana, Ek and Suttichaya, Vasin}, title = {TrafficInternVL: Spatially-Guided Fine-Tuning with Caption Refinement for Fine-Grained Traffic Safety Captioning and Visual Question Answering}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {5299-5306} }