MoQA - A Multi-Modal Question Answering Architecture

Monica Haurilet, Ziad Al-Halah, Rainer Stiefelhagen; Proceedings of the European Conference on Computer Vision (ECCV) Workshops, 2018, pp. 0-0

Abstract


Multi-Modal Machine Comprehension (M3C) deals with extracting knowledge from multiple modalities such as figures, diagrams and text. Particularly, Textbook Question Answering (TQA) focuses on questions based on the school curricula, where the text and diagrams are extracted from textbooks. A subset of questions cannot be answered solely based on diagrams, but requires external knowledge of the surrounding text. In this work, we propose a novel deep model that is able to handle different knowledge modalities in the context of the question answering task. We compare three different information representations encountered in TQA: a visual representation learned from images, a graph representation of diagrams and a language-based representation learned from accompanying text. We evaluate our model on the TQA dataset that contains text and diagrams from the sixth grade material. Even though our model obtains competing results compared to stateof-the-art, we still witness a significant gap in performance compared to humans. We discuss in this work the shortcomings of the model and show the reason behind the large gap to human performance, by exploring the distribution of the multiple classes of mistakes that the model makes.

Related Material


[pdf]
[bibtex]
@InProceedings{Haurilet_2018_ECCV_Workshops,
author = {Haurilet, Monica and Al-Halah, Ziad and Stiefelhagen, Rainer},
title = {MoQA - A Multi-Modal Question Answering Architecture},
booktitle = {Proceedings of the European Conference on Computer Vision (ECCV) Workshops},
month = {September},
year = {2018}
}