Unsupervised Domain Adaptive Visual Question Answering in the Era of Multi-Modal Large Language Models

Weixi Weng, Rui Zhang, Xiaojun Meng, Jieming Zhu, Qun Liu, Chun Yuan; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 6248-6258

Abstract


Unsupervised domain adaptation (UDA) for visual question answering (VQA) has attracted research interest. However with Multi-modal Large Language Models (MLLMs) showing great performance on VQA datasets UDA for VQA based on MLLMs remains unexplored. To fill this gap we propose the first systematic approach to Unsupervised Domain Adaptation VQA based on MLLMs (UDAM). First we introduce semantic context feature alignment and domain query feature alignment which utilize a single token embedding for each modality to capture contextual domain information from unimodal inputs and conduct coarse-grained feature alignment on it thus alleviating domain shifts in the unimodal feature space. Second we propose the novel semantics-guided query feature alignment which differentiates important domain-specific queries from learnable query outputs and conducts fine-grained feature alignment controlled by a semantics-guided weight map to reduce domain shifts in the cross-modal feature space. Third we devise a pair-wise domain-aware prompt strategy which aids UDA by prompting MLLMs to discern the commonality of tasks and the distinctiveness of domains in multi-modal inputs. Extensive experiments demonstrate UDAM's effectiveness in adapting MLLMs to unlabeled new domains.

Related Material


[pdf]
[bibtex]
@InProceedings{Weng_2025_WACV, author = {Weng, Weixi and Zhang, Rui and Meng, Xiaojun and Zhu, Jieming and Liu, Qun and Yuan, Chun}, title = {Unsupervised Domain Adaptive Visual Question Answering in the Era of Multi-Modal Large Language Models}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {6248-6258} }