VQuAD: Video Question Answering Diagnostic Dataset

Vivek Gupta, Badri N. Patro, Hemant Parihar, Vinay P. Namboodiri; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, 2022, pp. 282-291

Abstract


In this paper, we investigate the task of Video-based Question Answering. We provide a diagnostic dataset that can be used to evaluate the extent of the reasoning abilities of various methods for solving this task. Previous datasets proposed for this task do not have this ability. Our dataset is large scale (around 1.3 million questions jointly for train and test) and evaluates both the spatial and temporal properties and the relationship between various objects for these properties. We evaluate the state-of-the-art language model (BERT) as a baseline to understand the extent of correlation based on language features alone. Other existing networks are then used to combine video features along with language features for solving this task. Unfortunately, we observe that the currently prevalent systems do not perform significantly better than the language baseline. We hypothesize that this is due to our efforts in ensuring that no obvious biases exist in this dataset and the dataset is balanced. To make progress, the learning techniques need to obtain an ability to reason, going beyond basic correlation of biases. This is an interesting and significant challenge provided through our work.

Related Material


[pdf]
[bibtex]
@InProceedings{Gupta_2022_WACV, author = {Gupta, Vivek and Patro, Badri N. and Parihar, Hemant and Namboodiri, Vinay P.}, title = {VQuAD: Video Question Answering Diagnostic Dataset}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops}, month = {January}, year = {2022}, pages = {282-291} }