VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge

Nath, Vishwesh; Li, Wenqi; Yang, Dong; Myronenko, Andriy; Zheng, Mingxin; Lu, Yao; Liu, Zhijian; Yin, Hongxu; Law, Yee Man; Tang, Yucheng; Guo, Pengfei; Zhao, Can; Xu, Ziyue; He, Yufan; Harmon, Stephanie; Simon, Benjamin; Heinrich, Greg; Aylward, Stephen; Edgar, Marc; Zephyr, Michael; Molchanov, Pavlo; Turkbey, Baris; Roth, Holger; Xu, Daguang

Vishwesh Nath, Wenqi Li, Dong Yang, Andriy Myronenko, Mingxin Zheng, Yao Lu, Zhijian Liu, Hongxu Yin, Yee Man Law, Yucheng Tang, Pengfei Guo, Can Zhao, Ziyue Xu, Yufan He, Stephanie Harmon, Benjamin Simon, Greg Heinrich, Stephen Aylward, Marc Edgar, Michael Zephyr, Pavlo Molchanov, Baris Turkbey, Holger Roth, Daguang Xu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 14788-14798

Abstract

Generalist vision language models (VLMs) have made significant strides in computer vision, but they fall short in specialized fields like healthcare, where expert knowledge is essential. Current large multimodal models like Gemini and GPT-4o are insufficient for medical tasks due to their reliance on memorized internet knowledge rather than the nuanced expertise required in healthcare. Meanwhile, existing medical VLMs (e.g. Med-Gemini) often lack expert consultation as part of their design, and many rely on outdated, static datasets that were not created with modern, large deep learning models in mind. VLMs are usually trained in three stages: vision pre-training, vision-language pre-training, and instruction fine-tuning (IFT). IFT has been typically applied using a mixture of generic and healthcare data. In contrast, we propose that for medical VLMs, a fourth stage of specialized IFT is necessary, which focuses on medical data and includes information from domain expert models. Domain expert models developed for medical use are crucial because they are specifically trained for certain clinical tasks, e.g. to detect tumors and classify abnormalities through segmentation and classification, which learn fine-grained features of medical data-features that are often too intricate for a VLM to capture effectively. This paper introduces a new framework, VILA-M3, for medical VLMs that utilizes domain knowledge via expert models. We argue that generic VLM architectures alone are not viable for real-world clinical applications and on-demand usage of domain-specialized expert model knowledge is critical for advancing AI in healthcare. Through our experiments, we show an improved state-of-the-art (SOTA) performance with an average improvement of ~9% over the prior SOTA model Med-Gemini and ~6% over models trained on the specific tasks. Our approach emphasizes the importance of domain expertise in creating precise, reliable VLMs for medical applications.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Nath_2025_CVPR, author = {Nath, Vishwesh and Li, Wenqi and Yang, Dong and Myronenko, Andriy and Zheng, Mingxin and Lu, Yao and Liu, Zhijian and Yin, Hongxu and Law, Yee Man and Tang, Yucheng and Guo, Pengfei and Zhao, Can and Xu, Ziyue and He, Yufan and Harmon, Stephanie and Simon, Benjamin and Heinrich, Greg and Aylward, Stephen and Edgar, Marc and Zephyr, Michael and Molchanov, Pavlo and Turkbey, Baris and Roth, Holger and Xu, Daguang}, title = {VILA-M3: Enhancing Vision-Language Models with Medical Expert Knowledge}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2025}, pages = {14788-14798} }