-
[pdf]
[supp]
[bibtex]@InProceedings{Zou_2025_CVPR, author = {Zou, Yuanhao and Yin, Zhaozheng}, title = {MVCM: Enhancing Multi-View and Cross-Modality Alignment for Medical Visual Question Answering and Medical Image-Text Retrieval}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops}, month = {June}, year = {2025}, pages = {180-190} }
MVCM: Enhancing Multi-View and Cross-Modality Alignment for Medical Visual Question Answering and Medical Image-Text Retrieval
Abstract
Recent advancements in medical vision-language tasks, such as Medical Visual Question Answering (Med-VQA) and Medical Image-Text Retrieval (Med-ITR), aim to jointly learn from images and texts. However, two main issues persist in the field: the neglect of multi-view medical images and incomplete cross-modality understanding. Current studies often treat each image-text pair as independent instances (i.e., at the instance-level), neglecting the comprehensive contextual information available from multi-view images of the same study. Although some methods have explored refined alignments, combining alignment of global representation with the token-wise alignment of local representations, they often utilize only a uni-modality encoder (e.g., visual encoder) for downstream applications, lacking comprehensive cross-modality understanding. To address these issues, this paper introduces a framework MVCM that supports Multi-View and Cross-Modality alignment for Med-VQA and Med-ITR tasks. Our proposed method fully utilizes multi-view images in radiology datasets and aligns them at the study-level. We also employ various pretext tasks to support cross-modality alignment. We fine-tune the proposed model on downstream tasks Med-VQA and Med-ITR, outperforming state-of-the-art methods across multiple datasets. The code will be publicly available.
Related Material