-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Xu_2026_CVPR, author = {Xu, Chenwei and Ye, Zhen and Wu, Shang and Li, Weijian and Wang, Zihan and Xia, Zhuofan and Lu, Lie and Maneriker, Pranav and Du, Fan and Li, Manling and Liu, Han}, title = {Towards Sparse Video Understanding and Reasoning}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {11357-11368} }
Towards Sparse Video Understanding and Reasoning
Abstract
We present **ReViSe** (_**Re**asoning with **Vi**deo **S**parsity_), a framework that combines multi-round reasoning with adaptive frame selection for video question answering (VQA). Existing vision-language models (VLMs) uniformly sample video frames, which introduces redundancy or irrelevancy. In contrast, ReViSe*interactively selects informative frames through multi-round reasoning. To achieve this, ReViSe includes three modules: a multi-round conversation module that retains frame selection history as memory; a reasoning tracer that maintains a chain-of-thought across rounds; and a self-correction mechanism that enforces structural and behavioral validity. ReViSe integrates seamlessly with both proprietary and open-source VLMs. It supports proprietary models in a "plug-and-play" manner and enables reinforcement fine-tuning for open-source models. Experiments on multiple VQA benchmarks show that **ReViSe** improves the video understanding ability of VLMs by improving accuracy while reducing the number of frames used.
Related Material

