Weakly Supervised Referring Image Segmentation with Intra-Chunk and Inter-Chunk Consistency

Jungbeom Lee, Sungjin Lee, Jinseok Nam, Seunghak Yu, Jaeyoung Do, Tara Taghavi; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 21870-21881

Abstract


Referring image segmentation (RIS) aims to localize the object in an image referred by a natural language expression. Most previous studies learn RIS with a large-scale dataset containing segmentation labels, but they are costly. We present a weakly supervised learning method for RIS that only uses readily available image-text pairs. We first train a visual-linguistic model for image-text matching and extract a visual saliency map through Grad-CAM to identify the image regions corresponding to each word. However, we found two major problems with Grad-CAM. First, it lacks consideration of critical semantic relationships between words. We tackle this problem by modeling the relationship between words through intra-chunk and inter-chunk consistency. Second, Grad-CAM identifies only small regions of the referred object, leading to low recall. Therefore, we refine the localization maps with self-attention in Transformer and unsupervised object shape prior. On three popular benchmarks (RefCOCO, RefCOCO+, G-Ref), our method significantly outperforms recent comparable techniques. We also show that our method is applicable to various levels of supervision and obtains better performance than recent methods.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Lee_2023_ICCV, author = {Lee, Jungbeom and Lee, Sungjin and Nam, Jinseok and Yu, Seunghak and Do, Jaeyoung and Taghavi, Tara}, title = {Weakly Supervised Referring Image Segmentation with Intra-Chunk and Inter-Chunk Consistency}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {21870-21881} }