ScVLM: Enhancing Vision-Language Model for Safety-Critical Event Understanding

Liang Shi, Boyu Jiang, Tong Zeng, Feng Guo; Proceedings of the Winter Conference on Applications of Computer Vision (WACV) Workshops, 2025, pp. 1061-1071

Abstract


Accurately identifying understanding and describing traffic safety-critical events (SCEs) including crashes tire strikes and near-crashes is crucial for advanced driver assistance systems automated driving systems and traffic safety. As SCEs are rare events most general vision-language models (VLMs) have not been trained sufficiently to link SCE videos and narratives which could lead to hallucinations and missing key safety characteristics. Here we introduce ScVLM a novel hybrid methodology that integrates supervised and contrastive learning techniques to classify the severity and types of SCEs as well as to generate narrative descriptions of SCEs. This approach utilizes classification to enhance VLMs' comprehension of driving videos and improve the rationality of event descriptions. The proposed approach is trained on and evaluated by more than 8600 SCEs from the Second Strategic Highway Research Program Naturalistic Driving Study dataset the largest publicly accessible driving dataset with videos and SCE annotations. The results demonstrate the superiority of the proposed approach in generating contextually accurate event descriptions and mitigating VLM hallucinations. The code will be available at https://github.com/datadrivenwheels/ScVLM

Related Material


[pdf] [arXiv]
[bibtex]
@InProceedings{Shi_2025_WACV, author = {Shi, Liang and Jiang, Boyu and Zeng, Tong and Guo, Feng}, title = {ScVLM: Enhancing Vision-Language Model for Safety-Critical Event Understanding}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV) Workshops}, month = {February}, year = {2025}, pages = {1061-1071} }