Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models

Xinpeng Ding, Jianhua Han, Hang Xu, Xiaodan Liang, Wei Zhang, Xiaomeng Li; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13668-13677

Abstract


The rise of multimodal large language models (MLLMs) has spurred interest in language-based driving tasks. However existing research typically focuses on limited tasks and often omits key multi-view and temporal information which is crucial for robust autonomous driving. To bridge these gaps we introduce NuInstruct a novel dataset with 91K multi-view video-QA pairs across 17 subtasks where each task demands holistic information (e.g. temporal multi-view and spatial) significantly elevating the challenge level. To obtain NuInstruct we propose a novel SQL-based method to generate instruction-response pairs automatically which is inspired by the driving logical progression of humans. We further present BEV-InMLLM an end-to-end method for efficiently deriving instruction-aware Bird's-Eye-View (BEV) features language-aligned for large language models. BEV-InMLLM integrates multi-view spatial awareness and temporal semantics to enhance MLLMs' capabilities on NuInstruct tasks. Moreover our proposed BEV injection module is a plug-and-play method for existing MLLMs. Our experiments on NuInstruct demonstrate that BEV-InMLLM significantly outperforms existing MLLMs e.g 9% improvement on various tasks. We release our NuInstruct at https://github.com/xmed-lab/NuInstruct.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Ding_2024_CVPR, author = {Ding, Xinpeng and Han, Jianhua and Xu, Hang and Liang, Xiaodan and Zhang, Wei and Li, Xiaomeng}, title = {Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {13668-13677} }