Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering

Peng Gao, Zhengkai Jiang, Haoxuan You, Pan Lu, Steven C. H. Hoi, Xiaogang Wang, Hongsheng Li; The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 6639-6648

Abstract


Learning effective fusion of multi-modality features is at the heart of visual question answering. We propose a novel method of dynamically fuse multi-modal features with intra- and inter-modality information flow, which alternatively pass dynamic information between and across the visual and language modalities. It can robustly capture the high-level interactions between language and vision domains, thus significantly improves the performance of visual question answering. We also show that, the proposed dynamic intra modality attention flow conditioned on the other modality can dynamically modulate the intra-modality attention of the current modality, which is vital for multimodality feature fusion. Experimental evaluations on the VQA 2.0 dataset show that the proposed method achieves the state-of-the-art VQA performance. Extensive ablation studies are carried out for the comprehensive analysis of the proposed method.

Related Material


[pdf]
[bibtex]
@InProceedings{Gao_2019_CVPR,
author = {Gao, Peng and Jiang, Zhengkai and You, Haoxuan and Lu, Pan and Hoi, Steven C. H. and Wang, Xiaogang and Li, Hongsheng},
title = {Dynamic Fusion With Intra- and Inter-Modality Attention Flow for Visual Question Answering},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2019}
}