-
[pdf]
[supp]
[bibtex]@InProceedings{Shaw_2025_ICCV, author = {Shaw, Ankit Kumar and Sah, Chandan Kumar and Lian, Xiaoli and Baig, Arsalan Shahid and Wen, Tuopu and Jiang, Kun and Yang, Mengmeng and Yang, Diange and Zhang, Li}, title = {SafeRoute: Enhancing Traffic Scene Understanding via a Unified Deep Learning and Multimodal LLM}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {4547-4556} }
SafeRoute: Enhancing Traffic Scene Understanding via a Unified Deep Learning and Multimodal LLM
Abstract
Autonomous vehicles (AVs) require highly reliable traffic sign recognition and robust lane detection to navigate safely in complex and dynamic environments. This paper presents SafeRoute, a unified perception framework that integrates deep learning with instruction-tuned Multimodal Large Language Model (MLLM) for comprehensive road scene understanding. For traffic sign recognition, we benchmark three state-of-the-art architectures, ResNet-50, YOLOv8, and RT-DETR, achieving accuracies of 99.8%, 98.0%, and 96.6% respectively. To address the limitations of traditional vision-only methods in lane detection under adverse conditions (e.g. occlusion, poor lighting, road wear), we introduced a MLLM-based pipeline, fine-tuned via instruction learning without requiring large-scale pretraining. Our approach introduces a novel Multimodal Adapter that fuses CNN-derived spatial features with EVA-CLIP embeddings, enabling fine-grained visual grounding and robustness to occlusion. By integrating these visual tokens into a LLaMA-2 decoder, our system performs semantic-level reasoning and interpretable scene understanding, moving beyond segmentation to structured, language-based lane perception. Quantitatively, SafeRoute achieves a Frame Overall Accuracy (FRM) of 53.87%, Question Overall Accuracy (QNS) of 82.83%, and lane detection accuracies of 99.6% in clear conditions and 93.0% at night. It also demonstrates robust reasoning in adverse conditions, with 88.4% accuracy in rain and 95.6% under lane degradation. Overall, SafeRoute introduces a new paradigm in AV perception by offering a unified, multimodal approach, significantly improving both the robustness and explainability of lane detection in safety-critical scenarios.
Related Material
