Evaluating Multimodal Vision-Language Model Prompting Strategies for Visual Question Answering in Road Scene Understanding

Aryan Keskar, Srinivasa Perisetla, Ross Greer; Proceedings of the Winter Conference on Applications of Computer Vision (WACV) Workshops, 2025, pp. 1027-1036

Abstract


Understanding complex traffic scenes is a crucial challenge in advancing autonomous driving systems. Visual Question Answering (VQA) tasks have emerged as a promising approach to extracting actionable insights from multimodal traffic data enabling vehicles to make accurate real-time decisions. The MAPLM-QA dataset introduced as part of the 2025 WACV Large Language Vision Models Challenge for Autonomous Driving (LLVM-AD) offers a robust benchmark for this task comprising 14000 multimodal frames combining high-resolution panoramic images and rendered Bird's Eye View (BEV) depictions of LiDAR 3D point clouds. In this work we explore the application of NVIDIA's Vision-Language Model (ViLA) to address VQAs in MAPLM-QA. By employing detailed prompt engineering tailored to the dataset we systematically evaluate ViLA's performance identifying strengths in certain metrics such as quality assessment while highlighting challenges in lane counting intersection recognition and nuanced scene understanding. Our findings illustrate the potential of Vision-Language Models (VLMs) in enhancing traffic scene analysis and autonomous driving establishing a strong foundation and analysis of limitations for future research in leveraging VLMs and multimodal datasets toward scalable robust traffic scene understanding.

Related Material


[pdf]
[bibtex]
@InProceedings{Keskar_2025_WACV, author = {Keskar, Aryan and Perisetla, Srinivasa and Greer, Ross}, title = {Evaluating Multimodal Vision-Language Model Prompting Strategies for Visual Question Answering in Road Scene Understanding}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV) Workshops}, month = {February}, year = {2025}, pages = {1027-1036} }