Feature Design for Bridging SAM and CLIP toward Referring Image Segmentation

Ito, Koichiro

Koichiro Ito; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 8357-8367

Abstract

Referring Image Segmentation (RIS) is a task aimed at segmenting objects expressed in natural language within an image. This task requires an understanding of the relationship between vision and language along with precise segmentation capabilities. In the field of computer vision CLIP and Segment anything model (SAM) have gained significant attention for their classification and the segmentation capabilities. Given that both models possess essential skills for RIS combining them seems to be an effective strategy. In this paper we propose a model that integrates CLIP and SAM to enhance RIS. Since SAM lacks classification capabilities we developed a module that supplies the SAM mask decoder with features that specify the target object. We introduce a new module which is trained on additional instance segmentation tasks. The features utilized and derived from this module serve as inputs for the SAM decoder. With these inputs SAM is expected to effectively segment areas corresponding to the given natural language expressions. We conducted experiments using the traditional RefCOCO/+/g as well as the recently introduced gRefCOCO and Ref-zom datasets demonstrating the advantages of our approach. Code will be available on https://github.com/hitachi-rd-cv/dfam.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Ito_2025_WACV, author = {Ito, Koichiro}, title = {Feature Design for Bridging SAM and CLIP toward Referring Image Segmentation}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {8357-8367} }