Multimodal Integration of Human-Like Attention in Visual Question Answering

Ekta Sood, Fabian Kögel, Philipp Müller, Dominike Thomas, Mihai Bâce, Andreas Bulling; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2023, pp. 2648-2658

Abstract


Human-like attention as a supervisory signal to guide neural attention has shown significant promise but is currently limited to unimodal integration - even for inherently multimodal tasks such as visual question answering (VQA). We present the Multimodal Human-Like Attention Network (MULAN) - the first method for multimodal integration of human-like attention on image and text during training of VQA models. MULAN integrates attention predictions from two state-of-the-art text and image saliency models into neural self-attention layers of a recent transformer-based VQA model. Through evaluations on the challenging VQAv2 dataset, we show that MULAN is competitive to state of the art in its model class - achieving 73.98% accuracy on test-std and 73.72% on test-dev with approximately 80% fewer trainable parameters than prior work. Overall, our work underlines the potential of integrating multimodal human-like attention into neural attention mechanisms for VQA.

Related Material


[pdf]
[bibtex]
@InProceedings{Sood_2023_CVPR, author = {Sood, Ekta and K\"ogel, Fabian and M\"uller, Philipp and Thomas, Dominike and B\^ace, Mihai and Bulling, Andreas}, title = {Multimodal Integration of Human-Like Attention in Visual Question Answering}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2023}, pages = {2648-2658} }