-
[pdf]
[arXiv]
[bibtex]@InProceedings{Liu_2025_CVPR, author = {Liu, Shaoyu and Li, Jianing and Zhao, Guanghui and Zhang, Yunjian and Meng, Xin and Yu, Fei Richard and Ji, Xiangyang and Li, Ming}, title = {EventGPT: Event Stream Understanding with Multimodal Large Language Models}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {29139-29149} }
EventGPT: Event Stream Understanding with Multimodal Large Language Models
Abstract
Event cameras capture visual information as asynchronous pixel change streams, excelling in challenging lighting and high-dynamic scenarios. Existing multimodal large language models (MLLMs) concentrate on natural RGB images, failing in scenarios where event data fits better. In this paper, we introduce EventGPT, the first MLLM for event stream understanding, pioneering the integration of large language models (LLMs) with event-based vision. To bridge the huge domain gap, we propose a three-stage optimization paradigm to progressively equip a pre-trained LLM with event understanding. Our EventGPT consists of an event encoder, a spatio-temporal aggregator, a linear projector, an event-language adapter, and an LLM. Firstly, GPT-generated RGB image-text pairs warm up the linear projector, following LLaVA, as the gap between natural images and language is smaller. Secondly, we construct N-ImageNet-Chat, a large synthetic dataset of event data and corresponding texts to enable the use of the spatio-temporal aggregator and to train the event-language adapter, thereby aligning event features more closely with the language space. Finally, we gather an instruction dataset, Event-Chat, which contains extensive real-world data to fine-tune the entire model, further enhancing its generalization ability. We construct a comprehensive benchmark, and experiments show that EventGPT surpasses previous state-of-the-art MLLMs in generation quality, descriptive accuracy, and reasoning capability.
Related Material