VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning

Chen, Kang; Wu, Xiangqian

Kang Chen, Xiangqian Wu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 27218-27227

Abstract

Achieving the optimal form of Visual Question Answering mandates a profound grasp of understanding grounding and reasoning within the intersecting domains of vision and language. Traditional VQA benchmarks have predominantly focused on simplistic tasks such as counting visual attributes and object detection which do not necessitate intricate cross-modal information understanding and inference. Motivated by the need for a more comprehensive evaluation we introduce a novel dataset comprising 23781 questions derived from 10124 image-text pairs. Specifically the task of this dataset requires the model to align multimedia representations of the same entity to implement multi-hop reasoning between image and text and finally use natural language to answer the question. Furthermore we evaluate this VTQA dataset comparing the performance of both state-of-the-art VQA models and our proposed baseline model the Key Entity Cross-Media Reasoning Network (KECMRN). The VTQA task poses formidable challenges for traditional VQA models underscoring its intrinsic complexity. Conversely KECMRN exhibits a modest improvement signifying its potential in multimedia entity alignment and multi-step reasoning. Our analysis underscores the diversity difficulty and scale of the VTQA task compared to previous multimodal QA datasets. In conclusion we anticipate that this dataset will serve as a pivotal resource for advancing and evaluating models proficient in multimedia entity alignment multi-step reasoning and open-ended answer generation. Our dataset and code is available at https://visual-text-qa.github.io/.

Related Material

[pdf] [arXiv]

[bibtex]

@InProceedings{Chen_2024_CVPR, author = {Chen, Kang and Wu, Xiangqian}, title = {VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {27218-27227} }