SafeRoute: Enhancing Traffic Scene Understanding via a Unified Deep Learning and Multimodal LLM

Shaw, Ankit Kumar; Sah, Chandan Kumar; Lian, Xiaoli; Baig, Arsalan Shahid; Wen, Tuopu; Jiang, Kun; Yang, Mengmeng; Yang, Diange; Zhang, Li

Ankit Kumar Shaw, Chandan Kumar Sah, Xiaoli Lian, Arsalan Shahid Baig, Tuopu Wen, Kun Jiang, Mengmeng Yang, Diange Yang, Li Zhang; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2025, pp. 4606-4615

Abstract

Autonomous vehicles (AVs) require highly reliable traffic sign recognition and robust lane detection to navigate safely in complex and dynamic environments. This paper presents SafeRoute, a unified perception framework that integrates deep learning with instruction-tuned Multimodal Large Language Model (MLLM) for comprehensive road scene understanding. For traffic sign recognition, we benchmark three state-of-the-art architectures, ResNet-50, YOLOv8, and RT-DETR, achieving accuracies of 99.8%, 98.0%, and 96.6% respectively. To address the limitations of traditional vision-only methods in lane detection under adverse conditions (e.g. occlusion, poor lighting, road wear), we introduced a MLLM-based pipeline, fine-tuned via instruction learning without requiring large-scale pretraining. Our approach introduces a novel Multimodal Adapter that fuses CNN-derived spatial features with EVA-CLIP embeddings, enabling fine-grained visual grounding and robustness to occlusion. By integrating these visual tokens into a LLaMA-2 decoder, our system performs semantic-level reasoning and interpretable scene understanding, moving beyond segmentation to structured, language-based lane perception. Quantitatively, SafeRoute achieves a Frame Overall Accuracy (FRM) of 53.87%, Question Overall Accuracy (QNS) of 82.83%, and lane detection accuracies of 99.6% in clear conditions and 93.0% at night. It also demonstrates robust reasoning in adverse conditions, with 88.4% accuracy in rain and 95.6% under lane degradation. Overall, SafeRoute introduces a new paradigm in AV perception by offering a unified, multimodal approach, significantly improving both the robustness and explainability of lane detection in safety-critical scenarios.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Shaw_2025_ICCV, author = {Shaw, Ankit Kumar and Sah, Chandan Kumar and Lian, Xiaoli and Baig, Arsalan Shahid and Wen, Tuopu and Jiang, Kun and Yang, Mengmeng and Yang, Diange and Zhang, Li}, title = {SafeRoute: Enhancing Traffic Scene Understanding via a Unified Deep Learning and Multimodal LLM}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {4606-4615} }