Causal-CoG: A Causal-Effect Look at Context Generation for Boosting Multi-modal Language Models

Shitian Zhao, Zhuowan Li, Yadong Lu, Alan Yuille, Yan Wang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13342-13351

Abstract


While Multi-modal Language Models (MLMs) demon strate impressive multimodal ability they still struggle on providing factual and precise responses for tasks like vi sual question answering (VQA). In this paper we address this challenge from the perspective of contextual informa tion. We propose Causal Context Generation Causal-CoG which is a prompting strategy that engages contextual infor mation to enhance precise VQA during inference. Specifi cally we prompt MLMs to generate contexts i.e text de scription of an image and engage the generated contexts for question answering. Moreover we investigate the ad vantage of contexts on VQA from a causality perspective introducing causality filtering to select samples for which contextual information is helpful. To show the effective ness of Causal-CoG we run extensive experiments on 10 multimodal benchmarks and showconsistent improvements e.g. +6.30% on POPE +13.69% on Vizwiz and +6.43% on VQAv2 compared to direct decoding surpassing exist ing methods. We hope Casual-CoG inspires explorations of context knowledge in multimodal models and serves as a plug-and-play strategy for MLM decoding.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Zhao_2024_CVPR, author = {Zhao, Shitian and Li, Zhuowan and Lu, Yadong and Yuille, Alan and Wang, Yan}, title = {Causal-CoG: A Causal-Effect Look at Context Generation for Boosting Multi-modal Language Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {13342-13351} }