Adapting Vision-Language Models for 3D CT/MRI Understanding on PMBB via Slice Selection and Explanation Analysis

Chen, Hongzhuo; Shukla, Rahul; Wu, Ruiming; Yang, Shu; Duong-Tran, Duy; Nguyen, Duy Minh Ho; Niepert, Mathias; Beeche, Cameron; Gee, James; Duda, Jeffrey; Sharma, Rakesh; Davatzikos, Christos; Witschey, Walter; Hou, Bojian; Shen, Li

Hongzhuo Chen, Rahul Shukla, Ruiming Wu, Shu Yang, Duy Duong-Tran, Duy Minh Ho Nguyen, Mathias Niepert, Cameron Beeche, James Gee, Jeffrey Duda, Rakesh Sharma, Christos Davatzikos, Walter Witschey, Bojian Hou, Li Shen; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2025, pp. 2294-2303

Abstract

Recent advances in vision--language models (VLMs) have enabled strong medical image understanding, yet most remain limited to 2D inputs and scale poorly to 3D CT and MRI data. Directly applying 3D encoders is computationally costly and often weakens diagnostic signals by including irrelevant slices. We propose a generalizable framework to adapt 2D medical VLMs to 3D imaging through principled slice selection and multimodal instruction tuning. Our two-stage pipeline first aligns image-text pairs with caption supervision, then fine-tunes on synthetic diagnostic conversations grounded in radiology reports. Using PMBB data, we benchmark energy-based and K-Center strategies across accuracy, explanation quality, and efficiency. Results show that selecting about half of the slices (e.g., K=5 of 11) yields the best diagnostic accuracy while reducing training cost by over 40%. Explanation quality, however, does not always track accuracy. Coverage and overlap analyses indicate strong complementarity across strategies, suggesting that diverse slice selection can recover most correct cases with compact subsets. These findings highlight the importance of visual-textual pretraining, task-specific fine-tuning, and principled slice selection in scaling medical VLMs to volumetric data.

Related Material

[pdf]

[bibtex]

@InProceedings{Chen_2025_ICCV, author = {Chen, Hongzhuo and Shukla, Rahul and Wu, Ruiming and Yang, Shu and Duong-Tran, Duy and Nguyen, Duy Minh Ho and Niepert, Mathias and Beeche, Cameron and Gee, James and Duda, Jeffrey and Sharma, Rakesh and Davatzikos, Christos and Witschey, Walter and Hou, Bojian and Shen, Li}, title = {Adapting Vision-Language Models for 3D CT/MRI Understanding on PMBB via Slice Selection and Explanation Analysis}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {2294-2303} }