-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Mukherjee_2025_WACV, author = {Mukherjee, Anjishnu and Zhu, Ziwei and Anastasopoulos, Antonios}, title = {Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {1755-1764} }
Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models
Abstract
We present a comprehensive three-phase study to examine (1) the cultural understanding of Large Multimodal Models (LMMs) by introducing DALLE STREET a large-scale dataset generated by DALL-E 3 and validated by humans containing 9935 images of 67 countries and 10 concept classes; (2) the underlying implicit and potentially stereotypical cultural associations with a cultural artifact extraction task; and (3) an approach to adapt cultural representation in an image based on extracted associations using a modular pipeline CULTUREADAPT. We find disparities in cultural understanding at geographic sub-region levels with both open-source (LLaVA) and closed-source (GPT-4V) models on DALLE STREET and other existing benchmarks which we try to understand using over 18000 artifacts that we identify in association to different countries. Our findings reveal a nuanced picture of the cultural competence of LMMs highlighting the need to develop culture-aware systems.
Related Material