- [pdf] [supp]
Perception Matters: Detecting Perception Failures of VQA Models Using Metamorphic Testing
Visual question answering (VQA) takes an image and a natural-language question as input and returns a natural-language answer. To date, VQA models are primarily assessed by their accuracy on high-level reasoning questions. Nevertheless, Given that perception tasks (e.g., recognizing objects) are the building blocks in the compositional process required by high-level reasoning, there is a demanding need to gain insights into how much of a problem low-level perception is. Inspired by the principles of software metamorphic testing, we introduce MetaVQA, a model-agnostic framework for benchmarking perception capability of VQA models. Given an image i, MetaVQA is able to synthesize a low level perception question q. It then jointly transforms (i, q) to one or a set of sub-questions and sub-images. MetaVQA checks whether the answer to (i, q) satisfies metamorphic relationships (MRs), denoting perception consistency, with the composed answers of transformed questions and images. Violating MRs denotes a failure of answering perception questions. MetaVQA successfully detects over 4.9 million perception failures made by popular VQA models with metamorphic testing. The state-of-the-art VQA models (e.g., the champion of VQA 2020 Challenge) suffer from perception consistency problems. In contrast, the Oscar VQA models, by using anchor points to align questions and images, show generally better consistency in perception tasks. We hope MetaVQA will revitalize interest in enhancing the low-level perceptual abilities of VQA models, a cornerstone of high-level reasoning.