Evaluating Multimodal Large Language Models Across Distribution Shifts and Augmentations

Verma, Aayush Atul; Saeidi, Amir; Hegde, Shamanthak; Therala, Ajay; Bardoliya, Fenil Denish; Machavarapu, Nagaraju; Ravindhiran, Shri Ajay Kumar; Malyala, Srija; Chatterjee, Agneet; Yang, Yezhou; Baral, Chitta

Aayush Atul Verma, Amir Saeidi, Shamanthak Hegde, Ajay Therala, Fenil Denish Bardoliya, Nagaraju Machavarapu, Shri Ajay Kumar Ravindhiran, Srija Malyala, Agneet Chatterjee, Yezhou Yang, Chitta Baral; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 5314-5324

Abstract

Foundational models such as Multimodal Large Language Models (MLLMs) with their ability to interpret images and generate intricate responses has led to their widespread adoption across multiple computer vision and natural language processing tasks. However they suffer from hallucinations and struggle to reason over complex reasoning tasks. In this work we evaluate the performance of MLLMs across multiple multimodal augmentations and evaluate their performance in out-of-distribution settings. We benchmark 3 models across 2 vision-language datasets VQAv2 and CLEVR and assess their performance across adversarial transformations in both the vision and language modalities. We introduce image perturbations using various augmentations including noise addition blurring and median filtering and generate adversarial questions which contain conjunctions disjunctions and negations. Additionally we conduct a detailed fine-grained analysis to assess the model's performance on particular question categories such as those related to shape and color across images featuring identical or varying objects. Our findings indicate a notable decrease in the performance of current MLLMs for synthetic images with a gradual decline observed across both vision and language augmentations. Specifically Gaussian Noise Addition emerges as the most detrimental augmentation and we observe a significant drop in performance with complex questions containing multiple connectives. In these times of rapid development and deployment of MLLMs in real-world settings we believe our findings are a first step towards benchmarking the robustness and out-of-distribution behavior of such models.

Related Material

[pdf]

[bibtex]

@InProceedings{Verma_2024_CVPR, author = {Verma, Aayush Atul and Saeidi, Amir and Hegde, Shamanthak and Therala, Ajay and Bardoliya, Fenil Denish and Machavarapu, Nagaraju and Ravindhiran, Shri Ajay Kumar and Malyala, Srija and Chatterjee, Agneet and Yang, Yezhou and Baral, Chitta}, title = {Evaluating Multimodal Large Language Models Across Distribution Shifts and Augmentations}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {5314-5324} }