BanglaProtha: Evaluating Vision Language Models in Underrepresented Long-tail Cultural Contexts

Md Fahim, Md Sakib Ul Rahman, Akm Moshiur Rahman, Md Farhan Ishmam, Md Tasmim Rahman, Fariha Tanjim Shifat, Fabiha Haider, Md Farhad Alam Bhuiyan; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026, pp. 1159-1169

Abstract


The advanced multimodal processing of current vision language models (VLMs) has prompted rigorous benchmarking in multicultural settings, revealing a clear inclination toward Western culture. While the bias likely stems from the predominance of Western-centric images in the VLM pretraining data, the resulting long-tail distribution problem is only exacerbated in underrepresented cultural settings, such as Bengali. Our work explores this problem through an aspect-based evaluation of several classes of VLMs on the rich Bengali culture. Our BanglaProtha dataset is a VQA dataset, containing images that encapsulate Bengali cultural elements, questions in native Bengali, and semantically similar multiple-choice answer options. Our experiments provide behavioral insights of VLMs across prompting & fine-tuning strategies, cultural aspects, model size, and augmentation methods. Our work serves as a diagnostic tool for addressing and mitigating inequalities in multicultural and multilingual settings, thereby bringing efforts to democratize AI systems. Our code and data are available at https://github.com/farhanishmam/BanglaProtha.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Fahim_2026_WACV, author = {Fahim, Md and Rahman, Md Sakib Ul and Rahman, Akm Moshiur and Ishmam, Md Farhan and Rahman, Md Tasmim and Shifat, Fariha Tanjim and Haider, Fabiha and Alam Bhuiyan, Md Farhad}, title = {BanglaProtha: Evaluating Vision Language Models in Underrepresented Long-tail Cultural Contexts}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {March}, year = {2026}, pages = {1159-1169} }