VRAG: Retrieval-Augmented Video Question Answering for Long-Form Videos

Gia, Bao Tran; Le, Khiem; Do, Tien; Mai, Tien-Dung; Ngo, Thanh Duc; Le, Duy-Dinh; Satoh, Shin'ichi

Bao Tran Gia, Khiem Le, Tien Do, Tien-Dung Mai, Thanh Duc Ngo, Duy-Dinh Le, Shin'ichi Satoh; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2025, pp. 3698-3707

Abstract

The rapid expansion of video data across various domains has heightened the demand for efficient retrieval and question-answering systems, particularly for long-form videos. Existing Video Question Answering (VQA) approaches struggle with processing extended video sequences due to high computational costs, loss of contextual coherence, and challenges in retrieving relevant information. To tackle these limitations, we introduce VRAG: Retrieval-Augmented Video Question Answering for Long-Form Videos, a novel framework that brings a retrieval-augmented generation (RAG) architecture to the video domain. VRAG first retrieves the most relevant video segments and then applies chunking and refinement to identify key sub-segments, enabling precise and focused answer generation. This approach maximizes the effectiveness of the Multimodal Large Language Model (MLLM) by ensuring only the most relevant content is processed. Our experimental evaluation on a benchmark demonstrates significant improvements in retrieval precision and answer quality. These results highlight the effectiveness of retrieval-augmented reasoning for scalable and accurate VQA in long-form video datasets.

Related Material

[pdf]

[bibtex]

@InProceedings{Gia_2025_CVPR, author = {Gia, Bao Tran and Le, Khiem and Do, Tien and Mai, Tien-Dung and Ngo, Thanh Duc and Le, Duy-Dinh and Satoh, Shin'ichi}, title = {VRAG: Retrieval-Augmented Video Question Answering for Long-Form Videos}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2025}, pages = {3698-3707} }