Streaming VideoLLMs for Real-Time Procedural Video Understanding

Dibyadip Chatterjee, Edoardo Remelli, Yale Song, Bugra Tekin, Abhay Mittal, Bharat Bhatnagar, Necati Cihan Camgoz, Shreyas Hampali, Eric Sauser, Shugao Ma, Angela Yao, Fadime Sener; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 22586-22598

Abstract


We introduce ProVideLLM, an end-to-end framework for real-time procedural video understanding. ProVideLLM integrates a multimodal cache configured to store two types of tokens -- verbalized text tokens, which provide compressed textual summaries of long-term observations, and visual tokens, encoded with DETR-QFormer to capture fine-grained details from short-term observations. This design reduces token count by 22xover existing methods when representing one hour of long-term observations while effectively encoding fine-granularity of the present. By interleaving these tokens in our multimodal cache, ProVideLLM achieves sub-linear scaling of memory and compute with video length, ensuring per-frame streaming inference at 10 FPS and streaming dialogue at 25 FPS, with a minimal 2GB GPU memory footprint. ProVideLLM also sets new state-of-the-art results on six procedural tasks across four datasets.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Chatterjee_2025_ICCV, author = {Chatterjee, Dibyadip and Remelli, Edoardo and Song, Yale and Tekin, Bugra and Mittal, Abhay and Bhatnagar, Bharat and Camgoz, Necati Cihan and Hampali, Shreyas and Sauser, Eric and Ma, Shugao and Yao, Angela and Sener, Fadime}, title = {Streaming VideoLLMs for Real-Time Procedural Video Understanding}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {22586-22598} }