-
[pdf]
[bibtex]@InProceedings{Rivera_2025_WACV, author = {Rivera, Esteban and L\"ubberstedt, Jannik and Uhlemann, Nico and Lienkamp, Markus}, title = {Scenario Understanding of Traffic Scenes Through Large Visual Language Models}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV) Workshops}, month = {February}, year = {2025}, pages = {1037-1045} }
Scenario Understanding of Traffic Scenes Through Large Visual Language Models
Abstract
Deep learning models for autonomous driving encompassing perception planning and control depend on vast datasets to achieve their high performance. However their generalization often suffers due to domain-specific data distributions making an effective scene-based categorization of samples necessary to improve their reliability across diverse domains. Manual captioning though valuable is both labor-intensive and time-consuming creating a bottleneck in the data annotation process. Large Visual Language Models (LVLMs) present a compelling solution by automating image analysis and categorization through contextual queries often without requiring retraining for new categories. In this study we evaluate the capabilities of LVLMs including GPT-4 and LLaVA to understand and classify urban traffic scenes on both an in-house dataset and the BDD100K. We propose a scalable captioning pipeline that integrates state-of-the-art models enabling a flexible deployment on new datasets. Our analysis combining quantitative metrics with qualitative insights demonstrates the effectiveness of LVLMs to understand urban traffic scenarios and highlights their potential as an efficient tool for data-driven advancements in autonomous driving.
Related Material