REEF: Relevance-Aware and Efficient LLM Adapter for Video Understanding

Sakib Reza, Xiyun Song, Heather Yu, Zongfang Lin, Mohsen Moghaddam, Octavia Camps; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2025, pp. 2592-2603

Abstract


Integrating vision models into large language models (LLMs) has sparked significant interest in creating vision-language foundation models, especially for video understanding. Recent methods often utilize memory banks to handle untrimmed videos for video-level understanding. However, they typically compress visual memory using similarity-based greedy approaches, which can overlook the contextual importance of individual tokens. To address this, we introduce an efficient LLM adapter designed for video-level understanding of untrimmed videos that prioritizes the contextual relevance of spatio-temporal tokens. Our framework leverages scorer networks to selectively compress the visual memory bank and filter spatial tokens based on relevance, using a differentiable Top-K operator for end-to-end training. Across three key video-level understanding tasks-- untrimmed video classification, video question answering, and video captioning--our method achieves competitive or superior results on four large-scale datasets while reducing computational overhead by up to 34%. Code is available at: https://github.com/fw-ic/REEF-VideoLLM/

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Reza_2025_CVPR, author = {Reza, Sakib and Song, Xiyun and Yu, Heather and Lin, Zongfang and Moghaddam, Mohsen and Camps, Octavia}, title = {REEF: Relevance-Aware and Efficient LLM Adapter for Video Understanding}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2025}, pages = {2592-2603} }