Foundational Multi-Task Multimodal Model for Upper GI Endoscopy

He, Yuxuan; Chen, Qilei; Liu, Benyuan; Cao, Yu

Yuxuan He, Qilei Chen, Benyuan Liu, Yu Cao; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2025, pp. 6671-6680

Abstract

Upper gastrointestinal (GI) diseases are a major health threat. Especially gastric cancer, which is the 5th most common cancer worldwide and the 3rd common cause of cancer death. In the U.S., around 62 million people are diagnosed with digestive disorders annually. Gastroscopy is currently the only commonly recognized effective method for screening and treatment. Even though there are many machine learning-based techniques for computer-aided detection and diagnosis, multimodal models that integrate both textual and visual information are extremely lacking. In this paper, we present a foundational multimodal framework for upper GI endoscopy, adapting recent vision-language model architectures. For the first time, we have applied large-scale multimodal modeling in the upper gastrointestinal (GI) endoscopy field, which is a domain that has received less attention in prior medical image research. Based on the four downstream tasks established in previous work, our framework integrates both public and internal GI endoscopy expert-annotated datasets. Through the unified instruction-based fine-tuning pipeline, the framework supports most upper GI diagnostic, detection, and reporting scenarios. It not only covers clinical tasks such as anatomical recognition, lesion diagnosis, and lesion localization, but also enables natural language interaction and communication with clinicians, greatly enhancing the model's practicality and clinical value. Experimental results show that the framework can generalization and perform in various GI endoscopy tasks. It is able to handle a variety of medical tasks and report summarization capabilities that meet the needs of medical experts. Although it still has space for improvement in some tasks, the framework lays the foundation for further optimization and wider application.

Related Material

[pdf]

[bibtex]

@InProceedings{He_2025_ICCV, author = {He, Yuxuan and Chen, Qilei and Liu, Benyuan and Cao, Yu}, title = {Foundational Multi-Task Multimodal Model for Upper GI Endoscopy}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {6671-6680} }