Variational Causal Inference Network for Explanatory Visual Question Answering

Dizhan Xue, Shengsheng Qian, Changsheng Xu; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 2515-2525

Abstract


Explanatory Visual Question Answering (EVQA) is a recently proposed multimodal reasoning task that requires answering visual questions and generating multimodal explanations for the reasoning processes. Unlike traditional Visual Question Answering (VQA) which focuses solely on answering, EVQA aims to provide user-friendly explanations to enhance the explainability and credibility of reasoning models. However, existing EVQA methods typically predict the answer and explanation separately, which ignores the causal correlation between them. Moreover, they neglect the complex relationships among question words, visual regions, and explanation tokens. To address these issues, we propose a Variational Causal Inference Network (VCIN) that establishes the causal correlation between predicted answers and explanations, and captures cross-modal relationships to generate rational explanations. First, we utilize a vision-and-language pretrained model to extract visual features and question features. Secondly, we propose a multimodal explanation gating transformer that constructs cross-modal relationships and generates rational explanations. Finally, we propose a variational causal inference to establish the target causal structure and predict the answers. Comprehensive experiments demonstrate the superiority of VCIN over state-of-the-art EVQA methods.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Xue_2023_ICCV, author = {Xue, Dizhan and Qian, Shengsheng and Xu, Changsheng}, title = {Variational Causal Inference Network for Explanatory Visual Question Answering}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {2515-2525} }