Two-stage Multimodality Fusion for High-performance Text-based Visual Question Answering
Text-based visual question answering (TextVQA) is to answer a text-related question by reading texts in a given image, which needs to jointly reason over three modalities --- question, visual objects and scene texts in images. Most existing works leverage graph or sophisticated attention mechanisms to enhance the interaction between scene texts and visual objects. In this paper, observing that compared with visual objects, the question and scene text modalities are more important in TextVQA while both layouts and visual appearances of scene texts are also useful, we propose a two-stage multimodality fusion based method for high-performance TextVQA, which first semantically combines the question and OCR tokens to understand texts better and then integrates the combined results into visual features as additional information. Furthermore, to alleviate the redundancy and noise in the recognized scene texts, we develop a denoising module with contrastive loss to make our model focus on the relevant texts and thus obtain more robust features. Experiments on the TextVQA and ST-VQA datasets show that our method achieves competitive performance without any large-scale pre-training used in recent works, and outperforms the state-of-the-art methods after being pre-trained.