ArtQuest: Countering Hidden Language Biases in ArtVQA

The task of Visual Question Answering (VQA) has been studied extensively on general-domain real-world images. Transferring insights from general domain VQA to the art domain (ArtVQA) is non-trivial, as the latter requires models to identify abstract concepts, details of brushstrokes and styles of paintings in the visual data as well as possess background knowledge about art. This is exacerbated by the lack of high-quality datasets. In this work, we shed light on hidden linguistic biases in the AQUA dataset, which is the only publicly available benchmark dataset for ArtVQA. As a result, the majority of questions can be answered without consulting the visual information, making the “V” in ArtVQA rather insignificant. In order to counter this problem, we create a simple, yet practical dataset, ArtQuest, using structured information from the SemArt collection. Our dataset and the pipeline to reproduce our results are publicly available at https://github.com/bletib/artquest.


Introduction
The emergence of large foundation models has led to notable improvements in multimodal vision-language understanding tasks such as visual question answering (VQA; [8,20,32]).While these models have been extensively studied for general-domain tasks on generic real-world images, their capabilities in understanding specific domains such as art remains unclear.Art is a fundamental aspect of human culture, and art museums are visited by many millions of people every year.Thus, achieving visual question answering in the art domain (ArtVQA) is an important step towards conversational systems that can guide and assist people by addressing their information needs.Imagine encountering an interesting artwork and wondering who created it or in which time-frame it was created.ArtVQA can emit the answer to this, given a photo of the artwork and the relevant Achieving ArtVQA is a challenging task, since the model needs to understand the detailed visual information in paintings, e.g., brushstrokes and common patterns in artistic styles for inferring information about the artist, type or art movements from the painting.This visual information is also often represented at different levels of abstraction, making the visual understanding quite different from the understanding of general-domain images.Moreover, the model needs to interpret the natural language question and associate it with the visual data.ArtVQA also requires the model to possess background knowledge about the historical context of artworks, e.g., "when was the painting created?" [12].
Our work employs a generative approach for ArtVQA using a prefix language modeling objective.We investigate AQUA, the only publicly available benchmark dataset for ArtVQA and identify hidden language biases that exist in this data, casting doubt on its value for VQA evaluation.In particular, we show that, due to hidden biases, the majority of questions can be answered without any dependency on the visual information, making the "V" in VQA rather insignificant.These biases can falsely suggest that AI models are making progress in visual understanding of artworks.This observation motivated us to provide a cleaner, more reliable, and less biased dataset for the task of ArtVQA that genuinely requires consulting the visual data to answer knowledgeseeking questions.We propose ArtQuest (Art Questions), a new set of question-answer pairs for the paintings in the Se-mArt collection [11] using the structured information in this collection.We show that ArtQuest elevates the importance of visual data for answering questions and hence, allows for a more reliable training and evaluation of ArtVQA models.While ArtQuest consists of simple types of questions, we believe it is the first step for enabling reliable benchmarking of ArtVQA models.To the best of our knowledge, this is the first work to study linguistic biases in ArtVQA as well as to evaluate the performance of state-of-the-art vision and language models in the art domain.

Related Work
Visual Question Answering.Several attempts aim at solving VQA as a classification task for predicting the unique answers seen in the training dataset [2,18].Recent research shows rapid progress in VQA using Vision-Language Pre-training (VLP).VLP learns effective representations for both visual and textual data while capturing their correspondence.Existing VLP approaches use a large amount of image-text pairs and pre-train tasks such as contrastive learning of vision-language data [13,27,38], prefix language modeling [37], image-conditioned masked language modeling or text-conditioned masked image modeling [4,15,34] as well as image-conditioned causal language modeling [20,21,39].Most of these VLP models employ a fully-connected layer on top of their VLP architecture to recast the VQA task as classification [4,10,32,36].[1] implemented generative versions of ViLBERT [24] and ALBEF [22] and showed that generative approaches tend to result in better out-of-distribution generalization.Inspired by this, we choose a generative prefix language modeling approach for solving ArtVQA.Regarding language priors and biases in the general domain VQA, authors in [14] published a new benchmark dataset with reduced language priors.In [29], adversarial regularization by training with question-only adversary setting has been proposed.Furthermore, [29] aims at reducing the superficial correlations between questions and their corresponding frequent answers by adding the objective of distinguishing superficial similar instances in the training step.
Vision-Language and VQA for Art.In the art domain, VLP has enabled recent advances in artistic image generation from text prompts [5,30,31].In CLIP-Art [7], the contrastive vision-language loss from CLIP [27] is used to fine-tune on the iMet collection [40], leading to improvements on downstream multimodal retrieval and classification tasks for paintings.[3] present a framework for generating informative painting captions based on masked sentence generation using LSTM and knowledge retrieval using TF-IDF vectors.Authors report their experimental results on the SemArt collection [11].For ArtVQA, [11] made notable contributions by introducing the AQUA dataset and the VIKING baseline.The knowledge question-answer pairs in AQUA were generated using rule-based approaches similar to [16].For visual questions, the authors employed two different approaches.One was to use iQAN [23] to generate questions along with Amazon Rekognition for object detection as well as answer generation.Another approach was to use Pythia [33] to generate captions for each painting and then apply rule-based approaches on the generated captions so as to obtain question-answer pairs.Our work provides detailed analyses that reveal linguistic biases in AQUA.Subsequently, we propose ArtQuest in order to avoid such biases.

Prefix Language Modeling for ArtVQA
In this work, VQA is formulated as a generative sequenceto-sequence modeling task with the objective of Prefix Language Modeling (PrefixLM) [37].For a sequence s, the goal of PrefixLM is to auto-regressively predict s \t conditioned on the prefix sequence s t , where s t ⊕ s \t = s.The symbol ⊕ is used to denote concatenation throughout this work.
Closed-book ArtVQA.In the closed-book VQA scenario, given a training dataset T = {(v i , q i , a i )} D i=1 of size D, where v i is a painting, q i is an associated natural language question, and a i is the corresponding answer in natural language, our goal is to train a model that generates the correct answer a i given image-question pair (v i , q i ).We assume encoding functions to obtain f v ∈ R n as an n-dimensional vector encoding for image v i and the text embedding f q ∈ R m×l for the question q i with l tokens.We apply a fully connected projection layer F : R n → R m on f v and obtain f v ∈ R m , in order to unify embedding sizes of the encoded image and question.Furthermore, a transformer-based encoderdecoder architecture is used to achieve PrefixLM, where the decoder receives l+1) .The decoder is then trained to generate the encoded answer f a ∈ R m×l for the answer a i with l tokens.We generate each token in the answer sequence auto-regressively with cross-entropy as the loss function.In this approach, the concatenation of the encoded visual data and the cross-attention in the transformer decoder enables the VQA model to incorporate visual information for answer generation.
Open-book ArtVQA.We also consider ArtVQA in an open-book scenario, where the model is allowed to see an additional explanatory caption c i about the painting v i when providing the answer a i to the question q i .The motivation for this is that in ArtVQA, answering a question might require external background information not explicitly present in the painting.Therefore, the model may use the additional information in the explanatory text to elicit the correct answer to the question.
We employ an image-text retrieval approach in order to fetch the most relevant caption c i for a query image v i from a database containing art-related captions C. Using the appropriate encoding functions, each caption c with l tokens is encoded to obtain f c ∈ R m×l .We then employ average pooling to achieve a [CLS]-level embedding for representation of the caption sequence as f c ∈ R m .The image v i is also encoded as f v ∈ R n .Similar to the previous section, we apply a fully connected layer to unify the embedding dimensions.We then apply 1-Nearest Neighbor with cosine similarity to identify the closest caption in the database for  the query image v i : The retrieved caption c i is then concatenated to the question following the template: as suggested by [28].
Using the text encoding function, embedding f t ∈ R m×l is achieved for the final text T i .For answer generation by PrefixLM, the decoder receives The decoder is then optimized for auto-regressive generation of the encoded answer f a ∈ R m×l for the answer a i with l tokens using the cross-entropy loss.Using the cross-attention in the transformer decoder which receives the concatenated encoded visual data, answer generation is grounded on the visual information in our approach.
A schematic overview of our approach is shown in Fig. 1.In the open-book scenario, the optional retrieval module is activated and retrieves the most relevant caption for the given image.

Uncovering Language Biases in AQUA
Underlying language biases in VQA datasets can lead to the false impression that VQA models are making progress towards truly understanding images when they merely exploit language priors to achieve a high accuracy.Inspired by [14], we started our study on ArtVQA with the goal of understanding whether the "V" truly matters in ArtVQA.We used the only available benchmark dataset, AQUA, which encompasses two kinds of questions, visual and knowledge ones, which we each evaluated for language biases.Visual questions.Authors in [12] define visual questions as questions that mainly target visual contents in paintings, e.g, "what do a group of men stand next to?". 1 For studying language biases in visual questions, we used the solution illustrated in Section Sec. 3, while ignoring the image input.We considered the closed-book VQA scenario and used BART-base [19] as well as Flan-T5-small [6] for the Pre-fixLM modeling.In this case, for example, when using BART-base, we used BART's encoder as our Language Encoder and BART's decoder as the Vision-Language Decoder.The same setup was repeated when using the T5 model.
We performed our experiments in zero-shot as well as fine-tuned settings.Our motivation for reporting the zeroshot results of BART and T5 is to show that these models do not have prior background knowledge with respect to the art domain.When fine-tuning, we used question-answer pairs and fine-tuned the encoder-decoder model end-to-end.The encoder received questions and the decoder generated answers auto-regressively.We set the maximum length for the encoder input to 512 and for the decoder output to 100 tokens.Our final evaluations are performed on the AQUA test split.Results of our experiments are given in Tab. 1.
As can be seen, the zero-shot results of BART and T5 are quite weak on ArtVQA, confirming the assumption that these models do not already possess much prior knowledge in the art domain.However, once these models are fine-tuned merely on the question-answer pairs from AQUA without the presence of images, they reach up to 71% accuracy on answering visual questions.Achieving 71% accuracy with no visual data strongly points to the presence of hidden language biases in the visual questions, making the "V" fairly insignificant for this dataset.In order to further evaluate the effectiveness of our Pre-fixLM model, we repeated our experiment while considering the images.The results of this are also reported in Tab. 1.We use the CLIP visual encoder [27] with ViT-B/32 back-end for encoding paintings.Our results show that using the visual data results in about absolute 8% improvement in the accuracy of answering visual questions.This observation evinces that our PrefixLM modeling is effective in incorporating the visual information for the task of VQA.
Knowledge questions.[12] describes knowledge questions as questions that require background knowledge in the art domain for answering.These questions have been created by applying rule-based approaches such as those by [16] on the descriptions from the SemArt dataset [11].The fact that the visual data has not been considered in the process of creating question-answer pairs is our first clue regarding whether "V" really matters for answering these questions.
We performed closed-book as well as open-book VQA while ignoring the input images.We again employed our PrefixLM encoder-decoder approach from Sec. 3.For the closed-book scenario, we repeated the kinds of experiments described above for visual questions using BART-base and Flan-T5-small.The results are summarized in Tab. 2, where we report the accuracy scores.In this setting, the poor performance of both BART and T5 at answering knowledge questions in the closed-book scenario makes it apparent that achieving closed-book VQA for knowledge questions is a challenging task.Even after fine-tuning, these models achieve an accuracy of less than 13%.
In the open-book scenario, in order to assess the performance without the presence of visual data, we adapted our retrieval module in Fig. 1 to work without the image input.For this, we considered two approaches: 1. Question-based Caption Retrieval (QCR) using TF-IDF vectors and cosine similarity.We pick the top-10 captions and re-rank them by training a BERT-base classi-fier [9].The classifier receives a question and one of the top-10 captions c and learns a binary classification F : (q i , c) → {0, 1}.This approach is inspired by [12].
2. Oracle Method (OM) of fetching the corresponding caption for each image from the SemArt collection.Here, we use image names for finding the overlap between AQUA and the SemArt datasets.
For the open-book scenario, we experimented with the Flan-T5-small model.We chose T5, since it has been already optimized for supporting open-book question answering.
The encoder received the concatenated question and caption using the template described in Eq. ( 2).The maximum length for the encoder input and decoder output are set to 512 and 100 tokens, respectively.Results of our experiments in Tab. 3 illustrate that finetuning Flan-T5 using the QCR and OM retrieval approaches enables answering knowledge questions with an accuracy of up to 77.6% and 85%, respectively.These high scores alone are not necessarily an indicator of language bias in knowledge questions, since it could be that the captions already provide informative details about what is present in the visual data.However, we observe that when adding the visual component by encoding images and using PrefixLM, the accuracy stays the same, showing that the visual data is not playing an essential role for comprehending and answering questions.This observation once again suggests that "V" plays a negligible role for VQA in the AQUA dataset.
In addition, we provide qualitative examples in Fig. 2 to illustrate how knowledge questions do not depend on the visual information and can be answered regardless of the image.Based on OK-VQA [25], we believe it is necessary for knowledge questions in a VQA task to include dependencies and references to the visual information.As an example, OK-VQA includes the question "what sort of vehicle uses this item?", which is asked about an image from a fire hydrant in a street.The ground truth answer to this question is "firetruck".In this example, due to the mention of "this item" in the question, the VQA model must gain insights about the objects in the image, connect the question and objects, and ultimately, determine the correct answer.

ArtQuest: Art "Quest"ions
Question-answer generation.Given the shortcomings observed for the sole available benchmark dataset for ArtVQA, there is an urgent need to curate a more reliable, less biased ArtVQA benchmark.To this end, we harnessed the paintings and the structured information from the SemArt dataset [11].We used the artist, title, technique, school, time-frame, and type attributes from SemArt and manually created six initial open-ended English language questions for each painting.These initial questions are denoted by: Q = {"Who is the artist of this image?","What is the title of this painting?","What painting technique is used?", "What is the school of the painting?","In which time-frame was the painting painted?", "What is the type of this painting?"} We consider these questions to represent knowledge questions whose answers depend on the visual content of the paintings.Despite these questions being fairly simple, we argue that achieving ArtVQA models that answer these questions correctly is the first step towards developing reliable ArtVQA systems.In future work, we plan to expand the diversity of the questions with new versions of ArtQuest.
In order to engender greater lexical variety, we employed ChatGPT [26] to rephrase our initial questions.We used the prompt: "Rephrase each of the following questions 5 times.

Be as short and precise as possible!"
We invoked this prompt two times and manually selected a combination of the best generated questions from a human's perspective.As a result, we ended up having 5 differently phrased versions for each initial question2 in Q: For each painting, among the 5 question versions in Q i for each type, we randomly selected one version to use with that painting.As a result, we have 6 questions per painting, which are expressed in somewhat different ways across the paintings.As shown in Fig. 4, our question creation approach ensures a balanced question type distribution in ArtQuest.Questions sharing the same semantics across the paintings in ArtQuest is beneficial for making the dataset less prone to linguistic biases.According to [14], one approach for reducing language priors is to ensure that given a triplet (I, Q, A) of image, question, and answer, there exists an I such that the answer to Q is A = A. In ArtQuest, since the questions are semantically shared across all paintings, this condition is already satisfied.For example, a question like "Who is the artist of this image?" is asked about each painting and therefore is intrinsically less likely to always be answered with the same painter in the ArtQuest dataset.This is also supported by Fig. 4, which shows that ArtQuest includes works from a large number of artists.The same reasoning holds for other types of questions in ArtQuest.
The answer for each generated question is taken from the corresponding attribute value in the SemArt collection.For example, for the question "Who is the artist of this image?",we consider the painting name to match the painting in the SemArt dataset and then retrieve the value of the "artist" column in SemArt as the ground truth answer for the question.
Finally, we randomly selected 100 paintings from ArtQuest and asked an art expert to manually review the correctness of the 600 created question-answer pairs.All of the generated question-answer pairs were annotated as correct.In Fig. 3, a qualitative example of our created questionanswer pairs can be seen.Dataset analysis.We followed the splits in the Se-mArt [11] dataset also for the generated question-answer pairs.Thus, there are more than 17K, 1K, and 1K unique paintings in the training, validation, and test sets, respectively.For each image, we created the 6 different questionanswer pairs as described before.
We present the answer distribution for each question type in Fig. 4. The majority of questions about school, technique, time-frame, and type are answered by Italian, oil on canvas, 1601-1650, and religious, respectively.The corresponding ZeroR baselines for Italian as school, oil on canvas as technique, 1601-1650 as time-frame and religious as type are 41.4%, 47.1%, 17.7% and 38.8%, respectively.As can be seen, the answer distribution still makes it hard to obtain a very high accuracy when merely relying on priors.Moreover, in the distributions for the artist and title questions, there is a very wide variety of answer values.The distribution of question lengths in ArtQuest is plotted in Fig. 5.We observe that the majority of questions in ArtQuest include 5-7 words.Finally, we provide the distribution of questions by their first four words in Fig. 6.
Testing for language biases.We repeated the experiments from Sec. 4 once again but here evaluated whether there are language biases in ArtQuest.We used the BART-base model as well as the Flan-T5-small with and without the presence of images.The results are given in Tab. 4. We observe that zero-shot BART and T5 achieve very low accuracy scores, once again, showing that these models do not carry sufficient prior art knowledge.Fine-tuning BART without the presence of images achieves about 20%.This increase is due to the imbalance of the answer distribution in ArtQuest.As shown in Fig. 4, e.g., the majority of paintings in the dataset are from the Italian school.Therefore, a question such as "What is the school of the painting?"may get biased towards always answering "Italian".However, the overall language bias is found to be small.Without the visual information, the model cannot achieve a very high accuracy on ArtQuest.The same trend of explanation applies when fine-tuning closed-book and open-book T5 without images.In contrast, when testing with the presence of images, we observe substantial improvements of up to around 30% in the model's ability to correctly answer closed-book questions.This shows that the presence of "V" is significant for answering questions in ArtQuest.In the open-book scenario, we observe that answering the questions when only using the captions reaches up to 63% accuracy.This is because the SemArt captions contain background information about the painting and can include information such as title, artist, etc.We also test the open-book scenario with images and observe that using images for VQA improves the accuracy by up to 3%.This observation concludes that visual information provides additional information for the VQA model.

Benchmarking VQA Models
In this section, we provide baselines for VQA when using ArtQuest.For zero-shot closed-book VQA, we considered OFA [35] and BLIP [21], which are amongst the top VQA models in the general domain VQA leaderboard 3 .Our experimental results in Tab. 5 show that these models achieve less than 5% accuracy and BLEU score on ArtQuest.This shows that general-domain vision-language models lack prior art knowledge and do not generalize to the specific art domain.
When fine-tuning, we tested the VIKING closed-book ArtVQA model from [12] as well as our proposed PrefixLM model in both closed-book and open-book settings.VIKING was trained for 10 epochs with batch size 512.VIKING employs LSTM for encoding questions, ResNet-152 encoding paintings and Bilinear Attention Networks [18] for fusing the encoded questions and paintings.In the PrefixLM model, we used CLIP ViT-B/32 as the visual encoder, the encoder from Flan-T5-small as the language encoder, and the decoder from Flan-T5-small as the visionlanguage decoder.We set the maximum sequence length for the language encoder and vision-language decoder to 512 and 100, respectively.For the retrieval module, we used the text encoder from CLIP ViT-B/32 to encode the captions    from SemArt and truncated the captions to 76 tokens.Subsequently, 1-Nearest Neighbor with cosine similarity was used to select the most relevant caption per painting.For the implementation, we used the Faiss library [17].The top@1 accuracy for our retrieval module was 46%.The overall results of the fine-tuned models are provided in Tab. 5.The VIKING model achieves about 38% accuracy with a multi-label classification approach, cluing that treating ArtQuest with VIKING's classification approach is not effective.In contrast, the generative PrefixLM model achieves a better baseline of 50% and 53.5% exact match accuracy in closed-book and open-book settings, respectively.We hypothesise that by improving the accuracy of the retrieval module, stronger baselines can be achieved.Furthremore, in Tab.6, detailed accuracy and BLEU scores of using the PrefixLM model at answering each question type is provided.We observe that answering questions about the type of the painting is easier in comparison to the other question types in ArtQuest.Furthermore, correctly answering with the titles of artworks appears to be a challenging task.This is because the paintings in the test set are unseen in training and validation sets and hence not specifically get learned during training.The open-book setting also does not achieve a great accuracy, since our top@1 retrieval performance of the re-trieval module is only 46%.It is also apparent that predicting the artist is another challenging task.As shown in Fig. 4, for many of the artists in ArtQuest, there exists very few paintings.Therefore, few-shot learning approaches might be required to predict the artist more effectively.We hope that the baselines provided in this work motivate researchers to conduct further research on enhancing ArtVQA.

Conclusion
This work provides an extensive study on the current state of VQA in the art domain.We show that the only previously available benchmark dataset for ArtVQA is biased towards language priors, and hence, does not require considering the input image for answering questions.In order to address this problem, we propose ArtQuest as a new benchmark dataset for ArtVQA and through extensive experiments, show that ArtQuest does not suffer from language biases.
Director of Buildings, and exhibited in the Salon of 1751, the painting subsequently...

Figure 1 .
Figure 1.Schematic architecture of our PrefixLM model.Gray parts are only active in the open-book VQA setting.
When did van gogh live in the asylum of saint-paul-demausole? in the year before his death Who was one of the leading figures in neapolitan still-life painting during the baroque?Paolo Porpora Whose painting of the last supper is an interesting example of the baroque style of sketching?Maulbertsch's painting of the Last Supper

Figure 2 .
Figure 2. Knowledge question examples from AQUA where the question-answer pair is independent of the image.

Figure 4 .
Figure 4. Distribution of answers per question type.

Figure 5 .
Figure 5. Distribution of question lengths in ArtQuest.
who painted the whose work is during which period was during which time was how is the painting how would you classify in what category does what artistic movement does what is the painting's what is the name what is the title what is the artistic what name does the what painting technique was what school does the what technique did the what technique was used what type of artwork when was the painting which artist created the which method did the which school influenced the who is the artist who painted the artwork whose work is the

Figure 6 .
Figure 6.Distribution of questions by their first four words.

Table 1 .
Accuracy of closed-book VQA on visual questions from AQUA.Yellow highlighting corresponds to experiments without images (Visual Encoder: None), while cyan denotes experiments with images.

Table 2
. Accuracy of closed-book VQA on knowledge questions from AQUA.Yellow highlighting corresponds to experiments without images (Visual Encoder: None), while cyan denotes experiments with images.

Table 3 .
Accuracy of open-book VQA on knowledge questions from AQUA.Yellow highlighting corresponds to experiments with- out images (Visual Encoder: None), while cyan denotes experiments with images.

Table 4 .
Accuracy of open-book and closed-book VQA on ArtQuest.Yellow highlighting corresponds to experiments without images (Visual Encoder: None), while cyan denotes experiments with images.
7316Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

Table 6 .
Acccuracy and BLEU scores of VQA for each question type when using our PrefixLM.EM stands for Exact Match accuracy.