-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Wang_2025_ICCV, author = {Wang, Haochen and Chen, Qirui and Yan, Cilin and Cai, Jiayin and Jiang, Xiaolong and Hu, Yao and Xie, Weidi and Gavves, Stratis}, title = {Object-centric Video Question Answering with Visual Grounding and Referring}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {22274-22284} }
Object-centric Video Question Answering with Visual Grounding and Referring
Abstract
Video Large Language Models (VideoLLMs) have recently demonstrated remarkable progress in general video understanding. However, existing models primarily focus on high-level comprehension and are limited to text-only responses, restricting the flexibility for object-centric, multi-round interactions. In this paper, we make three contributions:(i) we address these limitations by introducing a VideoLLM, termed as **RGA3**, capable of performing both object referring and grounding for video reasoning tasks in a multi-round conversational manner, i.e., allowing users to iteratively interact with videos using both textual and visual queries; (ii) we propose **STOM** (Spatial-Temporal Overlay Module), a novel approach that allows arbitrary visual prompts to be processed at any timestamp within a video;(iii) we present **VideoInfer**, a manually curated object-centric video instruction dataset featuring question-answering pairs that require reasoning. We conduct comprehensive experiments on VideoInfer and other existing benchmarks across video question answering and referring video object segmentation. The results on 12 benchmarks spanning 6 tasks show that RGA3 consistently outperforms baseline models in both video question answering and segmentation, underscoring its robustness in multimodal, object-centric video and image understanding. The code, dataset, and web demo will be publicly released.
Related Material
