GSAM+Cutie: Text-Promptable Tool Mask Annotation for Endoscopic Video

Soberanis-Mukul, Roger D.; Cheng, Jiahuan; Mangulabnan, Jan Emily; Vedula, S. Swaroop; Ishii, Masaru; Hager, Gregory; Taylor, Russell H.; Unberath, Mathias

Roger D. Soberanis-Mukul, Jiahuan Cheng, Jan Emily Mangulabnan, S. Swaroop Vedula, Masaru Ishii, Gregory Hager, Russell H. Taylor, Mathias Unberath; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 2388-2394

Abstract

Machine learning approaches for multi-view geometric scene understanding in endoscopic surgery often assume temporal consistency across the frames to limit challenges that algorithms contend with. However in monocular scenarios where multiple views are acquired sequentially rather than simultaneously the static scene assumption is too strong because surgical tools move during the procedure. To enable multi-view models despite tool motion masking these temporally inconsistent tool regions is a feasible solution. However manual tool-masking requires a prohibitive effort given that endoscopic video can contain thousands of frames. This underscores the need for (semi-)automated techniques to 1) automatically mask the tools and/or 2) semi-automatically annotate large datasets such that algorithms for 1) may be developed. To facilitate semi-automated annotation any solution must be both generalizable such that it can be used out-of-the-box on diverse datasets and easy to use. Recent methods for surgical tool annotation require either fine-tuning on domain-specific data or excessive user interaction limiting their application to new data. Our work introduces GSAM+Cutie a surgical tool annotation process relying on a combination of two recent foundation models for text-based image segmentation and video object segmentation. We show that a combination of Grounded-SAM and Cutie models provides good generalization for robust text-prompt-based video-level binary segmentation on endoscopic video streamlining the mask annotation task. Through quantitative evaluation on two datasets including a proprietary in-house dataset and EndoVis we show that GSAM+Cutie outperforms similar ensembles like SAM-PT for video object segmentation. We also discuss the limitations and future research directions that GSAM+Cutie can motivate. Our code is available at https://github.com/arcadelab/cutie_plus_gsam

Related Material

[pdf]

[bibtex]

@InProceedings{Soberanis-Mukul_2024_CVPR, author = {Soberanis-Mukul, Roger D. and Cheng, Jiahuan and Mangulabnan, Jan Emily and Vedula, S. Swaroop and Ishii, Masaru and Hager, Gregory and Taylor, Russell H. and Unberath, Mathias}, title = {GSAM+Cutie: Text-Promptable Tool Mask Annotation for Endoscopic Video}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {2388-2394} }