Object-centric Video Question Answering with Visual Grounding and Referring

Wang, Haochen; Chen, Qirui; Yan, Cilin; Cai, Jiayin; Jiang, Xiaolong; Hu, Yao; Xie, Weidi; Gavves, Stratis

Haochen Wang, Qirui Chen, Cilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, Weidi Xie, Stratis Gavves; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 22274-22284

Abstract

Video Large Language Models (VideoLLMs) have recently demonstrated remarkable progress in general video understanding. However, existing models primarily focus on high-level comprehension and are limited to text-only responses, restricting the flexibility for object-centric, multi-round interactions. In this paper, we make three contributions:(i) we address these limitations by introducing a VideoLLM, termed as **RGA3**, capable of performing both object referring and grounding for video reasoning tasks in a multi-round conversational manner, i.e., allowing users to iteratively interact with videos using both textual and visual queries; (ii) we propose **STOM** (Spatial-Temporal Overlay Module), a novel approach that allows arbitrary visual prompts to be processed at any timestamp within a video;(iii) we present **VideoInfer**, a manually curated object-centric video instruction dataset featuring question-answering pairs that require reasoning. We conduct comprehensive experiments on VideoInfer and other existing benchmarks across video question answering and referring video object segmentation. The results on 12 benchmarks spanning 6 tasks show that RGA3 consistently outperforms baseline models in both video question answering and segmentation, underscoring its robustness in multimodal, object-centric video and image understanding. The code, dataset, and web demo will be publicly released.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Wang_2025_ICCV, author = {Wang, Haochen and Chen, Qirui and Yan, Cilin and Cai, Jiayin and Jiang, Xiaolong and Hu, Yao and Xie, Weidi and Gavves, Stratis}, title = {Object-centric Video Question Answering with Visual Grounding and Referring}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {22274-22284} }