VLA4CoDrive: Vision-Language-Action Dataset for Cooperative Autonomous Driving

Boroujeni, Sayed Pedram Haeri; Razi, Abolfazl

Sayed Pedram Haeri Boroujeni, Abolfazl Razi; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, 2026, pp. 1789-1799

Abstract

Vision-Language-Action (VLA) models are emerging as a promising direction for autonomous driving, yet progress is limited by the lack of comprehensive datasets that jointly support driving actions, rich textual grounding, and cooperative multi-vehicle perception under diverse conditions. We introduce VLA4CoDrive, a large-scale dataset in CARLA that provides synchronized multi-view, multi-agent driving data designed for both cooperative autonomous driving and standard single-vehicle learning. Unlike prior driving datasets that focus on single-agent sensing or non-synchronized multi-view recordings, VLA4CoDrive captures observations from two to three cooperating vehicles (with a scalable design extensible to more agents), enabling joint perception, shared situational understanding, and coordinated action capabilities not supported by any existing public dataset. Vehicles are recorded while cooperatively navigating shared routes, ensuring consistent spatiotemporal alignment and preserving cooperative dynamics rather than independent parallel recordings. To ensure broad coverage of real-world driving scenarios, we collect data across eight towns spanning dense urban, suburban, highway, and rural environments, systematically recorded under eight weather conditions with frame-level alignment preserved across variations. Each vehicle is equipped with four RGB cameras and complementary sensing modalities, providing rich multi-view and multi-modal perception. The dataset comprises 10 million vision samples, 150,000 language annotations, and 1 million action records, corresponding to approximately 300 hours of autonomous driving. Action supervision is provided through precise trajectory data, while language annotations are uniquely structured to include contextual captions, summaries, descriptions, and reasoning. By unifying synchronized multi-agent perception, rich language grounding, and action-level supervision at unprecedented scale, VLA4CoDrive establishes a foundation for cooperative VLA research, opening avenues for multi-agent reasoning, shared situational awareness, and next-generation autonomous driving systems. The dataset is publicly available on https://github.com/SayedPedramHaeri/VLA4CoDrive.

Related Material

[pdf]

[bibtex]

@InProceedings{Boroujeni_2026_WACV, author = {Boroujeni, Sayed Pedram Haeri and Razi, Abolfazl}, title = {VLA4CoDrive: Vision-Language-Action Dataset for Cooperative Autonomous Driving}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops}, month = {March}, year = {2026}, pages = {1789-1799} }