T3D: Advancing 3D Medical Vision-Language Pre-training by Learning Multi-View Visual Consistency

Liu, Che; Ouyang, Cheng; Chen, Yinda; Quilodrán-Casas, César; Ma, Lei; Fu, Jie; Guo, Yike; Shah, Anand; Bai, Wenjia; Arcucci, Rossella

Che Liu, Cheng Ouyang, Yinda Chen, César Quilodrán-Casas, Lei Ma, Jie Fu, Yike Guo, Anand Shah, Wenjia Bai, Rossella Arcucci; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2025, pp. 6763-6773

Abstract

While 3D visual self-supervised learning (vSSL) shows promising results in capturing visual representations, it overlooks the clinical knowledge from radiology reports. Meanwhile, 3D medical vision-language pre-training (MedVLP) remains underexplored due to the lack of a large-scale, publicly available 3D medical image-report dataset. To bridge this gap, we introduce CT-3DVLP, the first and largest public 3D volume-report dataset, establishing a comprehensive benchmark for 3D MedVLP research. Meanwhile, we propose the T3D framework, which enhances 3D MedVLP beyond naive CLIP-style alignment that directly pairs volumes with reports but neglects local visual representations. Instead, we introduce Text-informed Multi-view Alignment (TMA), a novel approach that clusters volumetric data while enforcing consistency across different views of the same volume-report pair. TMA integrates textual features into fine-grained visual representations, ensuring contextual coherence across views. We evaluate T3D across multiple downstream tasks in both unimodal and cross-modal settings, including zero-shot and fine-tuned classification, cross-modal retrieval, report generation, and semantic segmentation. Our results show that T3D consistently outperforms existing vSSL and multimodal methods, demonstrating superior zero-shot and fine-tuning capabilities and setting a new benchmark for 3D medical image understanding.

Related Material

[pdf] [arXiv]

[bibtex]

@InProceedings{Liu_2025_ICCV, author = {Liu, Che and Ouyang, Cheng and Chen, Yinda and Quilodr\'an-Casas, C\'esar and Ma, Lei and Fu, Jie and Guo, Yike and Shah, Anand and Bai, Wenjia and Arcucci, Rossella}, title = {T3D: Advancing 3D Medical Vision-Language Pre-training by Learning Multi-View Visual Consistency}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {6763-6773} }