ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

Fu, Haoyu; Zhang, Diankun; Zhao, Zongchuang; Cui, Jianfeng; Liang, Dingkang; Zhang, Chong; Zhang, Dingyuan; Xie, Hongwei; Wang, Bing; Bai, Xiang

Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, Xiang Bai; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 24823-24834

Abstract

End-to-end (E2E) autonomous driving methods still struggle to make correct decisions in interactive closed-loop evaluation due to limited causal reasoning capability. Current methods attempt to leverage the powerful understanding and reasoning abilities of Vision-Language Models (VLMs) to resolve this dilemma. However, the problem is still open that few VLMs for E2E methods perform well in the closed-loop evaluation due to the gap between the semantic reasoning space and the purely numerical trajectory output in the action space. To tackle this issue, we propose ORION, a holistic E2E autonomous driving framework by vision-language instructed action generation.ORION uniquely combines a QT-Former to aggregate long-term history context, a Large Language Model (LLM) for driving scenario reasoning, and a generative planner for precision trajectory prediction. ORION further aligns the reasoning space and the action space to implement a unified E2E optimization for both visual question-answering (VQA) and planning tasks. Our method achieves an impressive closed-loop performance of 77.47 Driving Score (DS) and 54.62% Success Rate (SR) on the challenge Bench2Drive datasets, which outperforms state-of-the-art (SOTA) methods by a large margin of 14.28 DS and 28.08% SR.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Fu_2025_ICCV, author = {Fu, Haoyu and Zhang, Diankun and Zhao, Zongchuang and Cui, Jianfeng and Liang, Dingkang and Zhang, Chong and Zhang, Dingyuan and Xie, Hongwei and Wang, Bing and Bai, Xiang}, title = {ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {24823-24834} }