-
[pdf]
[bibtex]@InProceedings{Sheng_2025_CVPR, author = {Sheng, Zihao and Huang, Zilin and Qu, Yansong and Leng, Yue and Chen, Sikai}, title = {Talk2Traffic: Interactive and Editable Traffic Scenario Generation for Autonomous Driving with Multimodal Large Language Model}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2025}, pages = {3797-3806} }
Talk2Traffic: Interactive and Editable Traffic Scenario Generation for Autonomous Driving with Multimodal Large Language Model
Abstract
Deploying autonomous vehicles (AVs) requires testing in diverse and challenging scenarios to ensure safety and reliability, yet collecting real-world data remains prohibitively expensive. While simulation-based approaches offer cost-effective alternatives, most existing methods lack sufficient support for intuitive, interactive editing of generated scenarios. This paper presents Talk2Traffic, a novel framework that leverages multimodal large language models (MLLMs) to enable interactive and editable traffic scenario generation. Talk2Traffic allows human users to generate various traffic scenarios through multimodal inputs (text, speech, and sketches). Our approach first employs an MLLM-based interpreter to extract structured representations from these inputs. These representations are then translated into executable Scenic code using a retrieval-augmented generation mechanism to reduce hallucinations and ensure syntactic correctness. Furthermore, a human feedback guidance module enables iterative refinement and editing of scenarios through natural language instructions. Experiments demonstrate that Talk2Traffic outperforms state-of-the-art methods in generating challenging scenarios. Qualitative evaluations further illustrate the framework can handle diverse input modalities and support scenario editing.
Related Material

