On the Hidden Treasure of Dialog in Video Question Answering

Deniz Engin, Francois Schnitzler, Ngoc Q. K. Duong, Yannis Avrithis; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 2064-2073

Abstract


High-level understanding of stories in video such as movies and TV shows from raw data is extremely challenging. Modern video question answering (VideoQA) systems often use additional human-made sources like plot synopses, scripts, video descriptions or knowledge bases. In this work, we present a new approach to understand the whole story without such external sources. The secret lies in the dialog: unlike any prior work, we treat dialog as a noisy source to be converted into text description via dialog summarization, much like recent methods treat video. The input of each modality is encoded by transformers independently, and a simple fusion method combines all modalities, using soft temporal attention for localization over long inputs. Our model outperforms the state of the art on the KnowIT VQA dataset by a large margin, without using question-specific human annotation or human-made plot summaries. It even outperforms human evaluators who have never watched any whole episode before. Code is available at https://engindeniz.github.io/dialogsummary-videoqa

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Engin_2021_ICCV, author = {Engin, Deniz and Schnitzler, Francois and Duong, Ngoc Q. K. and Avrithis, Yannis}, title = {On the Hidden Treasure of Dialog in Video Question Answering}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2021}, pages = {2064-2073} }