Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models

Anjishnu Mukherjee, Ziwei Zhu, Antonios Anastasopoulos; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 1755-1764

Abstract


We present a comprehensive three-phase study to examine (1) the cultural understanding of Large Multimodal Models (LMMs) by introducing DALLE STREET a large-scale dataset generated by DALL-E 3 and validated by humans containing 9935 images of 67 countries and 10 concept classes; (2) the underlying implicit and potentially stereotypical cultural associations with a cultural artifact extraction task; and (3) an approach to adapt cultural representation in an image based on extracted associations using a modular pipeline CULTUREADAPT. We find disparities in cultural understanding at geographic sub-region levels with both open-source (LLaVA) and closed-source (GPT-4V) models on DALLE STREET and other existing benchmarks which we try to understand using over 18000 artifacts that we identify in association to different countries. Our findings reveal a nuanced picture of the cultural competence of LMMs highlighting the need to develop culture-aware systems.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Mukherjee_2025_WACV, author = {Mukherjee, Anjishnu and Zhu, Ziwei and Anastasopoulos, Antonios}, title = {Crossroads of Continents: Automated Artifact Extraction for Cultural Adaptation with Large Multimodal Models}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {1755-1764} }