KAMP: Knowledge-Anchored Multimodal Pretraining Framework for Medical Image Representation

Huang, Feiyu; Li, Jia; Chen, Zhao; Wu, Yang; Cao, Caleb Chen; Chen, Lei

Feiyu Huang, Jia Li, Zhao Chen, Yang Wu, Caleb Chen Cao, Lei Chen; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 21067-21077

Abstract

Cross-modal biomedical signals such as pathology and genomics can provide richer and more robust semantic guidance for medical image representation learning. However, the availability of such guidance remains limited, as privacy constraints and acquisition costs severely restrict access to medical images paired with other biomedical data. A further challenge lies in modality discrepancy, which introduces intra-modal statistical bias and cross-modal noise, thereby degrading the quality of medical image representations. To address these challenges, we propose KAMP, a large language model (LLM)-driven multimodal pretraining framework for medical image representation learning. KAMP leverages textual priors as the semantic anchor to enhance medical image representations and align them with multimodal biomedical representations, enabling the learning of rich and robust features even when paired data are scarce. KAMP operates in three stages. First, the LLM generates personalized diagnostic knowledge from patient clinical text and imaging metadata. This knowledge is injected as a prior to enrich medical image representations and serves as a semantic anchor to reduce the representation gap between medical images and other biomedical modalities. Second, the LLM is optimized using Group Relative Policy Optimization (GRPO), with the cross-modal aligner pretrained in the first stage serving as the reward model. Third, the refined knowledge is used to retrain the cross-modal aligner, yielding more robust medical image representations while mitigating bias and noise introduced by other modalities. Comprehensive evaluations on brain, bladder, and liver cancer datasets demonstrate that KAMP outperforms existing methods in most downstream few-shot classification tasks.

Related Material

[pdf]

[bibtex]

@InProceedings{Huang_2026_CVPR, author = {Huang, Feiyu and Li, Jia and Chen, Zhao and Wu, Yang and Cao, Caleb Chen and Chen, Lei}, title = {KAMP: Knowledge-Anchored Multimodal Pretraining Framework for Medical Image Representation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {21067-21077} }