Towards Unifying Medical Vision-and-Language Pre-Training via Soft Prompts

Chen, Zhihong; Diao, Shizhe; Wang, Benyou; Li, Guanbin; Wan, Xiang

Zhihong Chen, Shizhe Diao, Benyou Wang, Guanbin Li, Xiang Wan; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 23403-23413

Abstract

Medical vision-and-language pre-training (Med-VLP) has shown promising improvements on many downstream medical tasks owing to its applicability to extracting generic representations from medical images and texts. Practically, there exist two typical types, i.e., the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used. The former is superior at multi-modal tasks owing to the sufficient interaction between modalities; the latter is good at uni-modal and cross-modal tasks due to the single-modality encoding ability. To take advantage of these two types, we propose an effective yet straightforward scheme named PTUnifier to unify the two types. We first unify the input format by introducing visual and textual prompts, which serve as DETR-like queries that assist in extracting features when one of the modalities is missing. By doing so, a single model could serve as a foundation model that processes various tasks adopting different input formats (i.e., image-only, text-only, and image-text-pair). Furthermore, we construct a prompt pool (instead of static ones) to improve diversity and scalability, enabling queries conditioned on different input instances. Experimental results show that our approach achieves state-of-the-art results on a broad range of tasks, spanning uni-modal tasks (i.e., image/text classification and text summarization), cross-modal tasks (i.e., image-to-text generation and image-text/text-image retrieval), and multi-modal tasks (i.e., visual question answering), demonstrating the effectiveness of our approach. Note that the adoption of prompts is orthogonal to most existing Med-VLP approaches and could be a beneficial and complementary extension to these approaches. The source code is available at https://anonymous.4open.science/r/ICCV-2023-Submission-PTUnifier/ and will be released in the final version of this paper.

Related Material

[pdf] [arXiv]

[bibtex]

@InProceedings{Chen_2023_ICCV, author = {Chen, Zhihong and Diao, Shizhe and Wang, Benyou and Li, Guanbin and Wan, Xiang}, title = {Towards Unifying Medical Vision-and-Language Pre-Training via Soft Prompts}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {23403-23413} }