EventGPT: Event Stream Understanding with Multimodal Large Language Models

Liu, Shaoyu; Li, Jianing; Zhao, Guanghui; Zhang, Yunjian; Meng, Xin; Yu, Fei Richard; Ji, Xiangyang; Li, Ming

Shaoyu Liu, Jianing Li, Guanghui Zhao, Yunjian Zhang, Xin Meng, Fei Richard Yu, Xiangyang Ji, Ming Li; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 29139-29149

Abstract

Event cameras capture visual information as asynchronous pixel change streams, excelling in challenging lighting and high-dynamic scenarios. Existing multimodal large language models (MLLMs) concentrate on natural RGB images, failing in scenarios where event data fits better. In this paper, we introduce EventGPT, the first MLLM for event stream understanding, pioneering the integration of large language models (LLMs) with event-based vision. To bridge the huge domain gap, we propose a three-stage optimization paradigm to progressively equip a pre-trained LLM with event understanding. Our EventGPT consists of an event encoder, a spatio-temporal aggregator, a linear projector, an event-language adapter, and an LLM. Firstly, GPT-generated RGB image-text pairs warm up the linear projector, following LLaVA, as the gap between natural images and language is smaller. Secondly, we construct N-ImageNet-Chat, a large synthetic dataset of event data and corresponding texts to enable the use of the spatio-temporal aggregator and to train the event-language adapter, thereby aligning event features more closely with the language space. Finally, we gather an instruction dataset, Event-Chat, which contains extensive real-world data to fine-tune the entire model, further enhancing its generalization ability. We construct a comprehensive benchmark, and experiments show that EventGPT surpasses previous state-of-the-art MLLMs in generation quality, descriptive accuracy, and reasoning capability.

Related Material

[pdf] [arXiv]

[bibtex]

@InProceedings{Liu_2025_CVPR, author = {Liu, Shaoyu and Li, Jianing and Zhao, Guanghui and Zhang, Yunjian and Meng, Xin and Yu, Fei Richard and Ji, Xiangyang and Li, Ming}, title = {EventGPT: Event Stream Understanding with Multimodal Large Language Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2025}, pages = {29139-29149} }