-
[pdf]
[bibtex]@InProceedings{Pervaiz_2025_ICCV, author = {Pervaiz, Zaid B. and Cha, Seunghwan and Gulati, Rohan and Jhuria, Monika and Praveen, Varun and Kornuta, Tomasz and Lu, Yao and Murali, Vidya}, title = {TrafficVILA: Scaling Vision-Language Models to High-Resolution Video Understanding for Traffic Safety Analysis}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {5366-5373} }
TrafficVILA: Scaling Vision-Language Models to High-Resolution Video Understanding for Traffic Safety Analysis
Abstract
Traffic scene understanding from surveillance video demands the detection of fine-grained details often overlooked by standard Vision Language Models (VLMs). We introduce TrafficVILA, a system designed for high-resolution video analysis in traffic safety applications. At its core is the NVILA-15B-HRL model, an extension of NVILA that applies dynamic tiling to video inputs, capturing critical details using six tiles per frame with temporal localization. TrafficVILA builds on this model with three key components: (1) video Set-of-Mark prompting using SAM2 for accurate object tracking, (2) a LLM-based fact checking pipeline that leverages MCQ predictions to reduce hallucinations, and (3) intelligent view and phase selection for multi-perspective datasets. This integrated design enables both fine spatial resolution and robust temporal reasoning. We applied TrafficVILA on the WTS dataset, and ranked in the top three of the 2025 AI City Challenge Track 2 leaderboard with a score of 58.85. Ablation studies show that bounding box overlays outperform segmentation masks, and fact checking significantly improves caption accuracy by mitigating hallucinations.
Related Material
