What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-Modal Language Models

Zhang, Letian; Zhai, Xiaotong; Zhao, Zhongkai; Wen, Xin; Zhao, Bingchen

Letian Zhang, Xiaotong Zhai, Zhongkai Zhao, Xin Wen, Bingchen Zhao; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2023, pp. 4629-4633

Abstract

Counterfactual reasoning ability is one of the core abilities of human intelligence. This reasoning process involves the processing of alternatives to observed states or past events, and this process can improve our ability for planning and decision-making. In this work, we focus on benchmarking the counterfactual reasoning ability of multi-modal large language models. We take the question and answer pairs from the VQAv2 dataset, and add one counterfactual presupposition to the questions, with the answer being modified accordingly. After generating counterfactual questions and answers using ChatGPT, we manually examined all generated question and answer to ensure correctness. This results in over 2k counterfactual question and answer pairs. We evaluate recent vision language models on our newly collected test dataset and found that all models exhibit a large performance drop compared to tested on questions without counterfactual presupposition. This result indicates that there still exists space for developing vision language models. We hope our proposed benchmark could help the development of future systems.

Related Material

[pdf]

[bibtex]

@InProceedings{Zhang_2023_ICCV, author = {Zhang, Letian and Zhai, Xiaotong and Zhao, Zhongkai and Wen, Xin and Zhao, Bingchen}, title = {What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-Modal Language Models}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2023}, pages = {4629-4633} }