Coarse to Fine Frame Selection for Online Open-Ended Video Question Answering

Vidyaranya Nuthalapati, Anirudh Tunga; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2023, pp. 353-361

Abstract


The central aim of Video Question Answering (VideoQA) is to provide answers to questions posed in natural language, relying on the content of the given videos. However, when applied to video streams like CCTV recordings and live broadcasts, the solver encounters more intricate challenges. In such scenarios, the segment of the video needed to answer a specific question is often a small component of the entire video. To address these complexities, a recent and innovative problem domain called Online Open-ended Video Question Answering (O2VQA) has been introduced. In this paper, we propose an architecture based on multi-modal foundational transformers for the O2VQA task. The architecture comprises three modules. The first module is responsible for the coarse selection of the target video segment relevant to answering the question. The second module refines this coarse segment by leveraging a Temporal Concept Spotting mechanism, enabling the capture of temporal saliency and resulting in the identification of frames most critical for addressing the question. Lastly, we employ an end-to-end Video-Language Pre-training model to provide the answer. To evaluate our proposed model, we conduct experiments on the publicly available ATBS dataset. The results showcase the superiority of our approach over current state-of-the-art models.

Related Material


[pdf]
[bibtex]
@InProceedings{Nuthalapati_2023_ICCV, author = {Nuthalapati, Vidyaranya and Tunga, Anirudh}, title = {Coarse to Fine Frame Selection for Online Open-Ended Video Question Answering}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2023}, pages = {353-361} }