MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts

Chen, Qiuhui; Hong, Yi

Qiuhui Chen, Yi Hong; Proceedings of the Asian Conference on Computer Vision (ACCV), 2024, pp. 2404-2420

Abstract

Vision language pretraining (VLP) models have proven effective in numerous computer vision applications. In this paper, we focus on developing a VLP model for the medical domain to facilitate computer-aided diagnoses (CAD) based on image scans and text descriptions from electronic health records. To achieve this, we introduce MedBLIP, a lightweight CAD system that bootstraps VLP from off-the-shelf frozen pre-trained image encoders and large language models. We incorporate a MedQFormer module to bridge the gap between 3D medical images and 2D pre-trained image encoders and language models. To evaluate the effectiveness of our MedBLIP, we have collected over 30,000 image volumes from five public Alzheimer's disease (AD) datasets: ADNI, NACC, OASIS, AIBL, and MIRIAD. On this large-scale AD dataset, our model demonstrates impressive performance in zero-shot classification of healthy, mild cognitive impairment (MCI), and AD subjects, and also shows its capability in medical visual question answering (VQA) on the M3D-VQA-AD dataset. The code and pre-trained models are available at https://github.com/Qybc/MedBLIP.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Chen_2024_ACCV, author = {Chen, Qiuhui and Hong, Yi}, title = {MedBLIP: Bootstrapping Language-Image Pre-training from 3D Medical Images and Texts}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2024}, pages = {2404-2420} }