Visual Question Answering With Textual Representations for Images

Yusuke Hirota, Noa Garcia, Mayu Otani, Chenhui Chu, Yuta Nakashima, Ittetsu Taniguchi, Takao Onoye; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2021, pp. 3154-3157

Abstract


How far can we go with textual representations for understanding pictures? Deep visual features extracted by object recognition models are prevailing used in multiple tasks, and especially in visual question answering (VQA). However, conventional deep visual features may struggle to convey all the details in an image as we humans do. Meanwhile, with recent language models' progress, descriptive text may be an alternative to this problem. This paper delves into the effectiveness of textual representations for image understanding in the specific context of VQA.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Hirota_2021_ICCV, author = {Hirota, Yusuke and Garcia, Noa and Otani, Mayu and Chu, Chenhui and Nakashima, Yuta and Taniguchi, Ittetsu and Onoye, Takao}, title = {Visual Question Answering With Textual Representations for Images}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2021}, pages = {3154-3157} }