-
[pdf]
[bibtex]@InProceedings{La_2024_ACCV, author = {La, Thang and Tran, Minh-Hanh and Dao, Viet-Hang and Tran, Thanh-Hai}, title = {LViTES: Leveraging vision and text for enhancing segmentation of endoscopic images}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV) Workshops}, month = {December}, year = {2024}, pages = {511-524} }
LViTES: Leveraging vision and text for enhancing segmentation of endoscopic images
Abstract
Automatic lesion segmentation in endoscopic images is crucial for mitigating the risk of omissions during analysis, particularly for inexperienced physicians or in situations of medical overload. Traditional segmentation models predominantly rely on pixel-level labeled images, often neglecting auxiliary information such as physicians' diagnostic conclusions. This study proposes a novel approach to harness available lesion information--including segmentation regions, physician conclusions, and supplementary disease descriptions--to improve segmentation efficacy. Our method builds upon the successful integration of CNN and Vision Transformer architectures from the LViT model, originally designed for lung cancer lesion segmentation from X-ray images using dual inputs: images and text. We propose a new framework, namely called LViTES with four key advancements: 1) optimizing the LViT architecture to enhance image feature extraction by incorporating the EfficientNet backbone and integrating Cross-Attention, while also reducing model complexity and parameters; 2) addressing the scarcity of textual descriptions in current datasets by developing a module that generates text from segmentation masks based on attributes like shape, location, size, and quantity; 3) incorporating both image and text inputs during training while allowing adaptive prediction with only image inputs to align with typical use cases; and 4) evaluating model performance using both generated text and physician-provided descriptions. The effectiveness of our approach is validated on three types of lesions--gastric cancer, esophageal cancer (our self-collected datasets), and polyps (Kvasir-SEG dataset)--demonstrating superior performance compared to state-of-theart methods.
Related Material