VideoLLM-online: Online Video Large Language Model for Streaming Video

Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, Mike Zheng Shou; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 18407-18418

Abstract


Large Language Models (LLMs) have been enhanced with vision capabilities enabling them to comprehend images videos and interleaved vision-language content. However the learning methods of these large multimodal models (LMMs) typically treat videos as predetermined clips rendering them less effective and efficient at handling streaming video inputs. In this paper we propose a novel Learning-In-Video-Stream (LIVE) framework which enables temporally aligned long-context and real-time dialogue within a continuous video stream. Our LIVE framework comprises comprehensive approaches to achieve video streaming dialogue encompassing: (1) a training objective designed to perform language modeling for continuous streaming inputs (2) a data generation scheme that converts offline temporal annotations into a streaming dialogue format and (3) an optimized inference pipeline to speed up interactive chat in real-world video streams. With our LIVE framework we develop a simplified model called VideoLLM-online and demonstrate its significant advantages in processing streaming videos. For instance our VideoLLM-online-7B model can operate at over 10 FPS on an A100 GPU for a 5-minute video clip from Ego4D narration. Moreover VideoLLM-online also showcases state-of-the-art performance on public offline video benchmarks such as recognition captioning and forecasting. The code model data and demo have been made available at showlab.github.io/videollm-online.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Chen_2024_CVPR, author = {Chen, Joya and Lv, Zhaoyang and Wu, Shiwei and Lin, Kevin Qinghong and Song, Chenan and Gao, Difei and Liu, Jia-Wei and Gao, Ziteng and Mao, Dongxing and Shou, Mike Zheng}, title = {VideoLLM-online: Online Video Large Language Model for Streaming Video}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {18407-18418} }