Mask Grounding for Referring Image Segmentation

Yong Xien Chng, Henry Zheng, Yizeng Han, Xuchong Qiu, Gao Huang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 26573-26583

Abstract


Referring Image Segmentation (RIS) is a challenging task that requires an algorithm to segment objects referred by free-form language expressions. Despite significant progress in recent years most state-of-the-art (SOTA) methods still suffer from considerable language-image modality gap at the pixel and word level. These methods generally 1) rely on sentence-level language features for language-image alignment and 2) lack explicit training supervision for fine-grained visual grounding. Consequently they exhibit weak object-level correspondence between visual and language features. Without well-grounded features prior methods struggle to understand complex expressions that require strong reasoning over relationships among multiple objects especially when dealing with rarely used or ambiguous clauses. To tackle this challenge we introduce a novel Mask Grounding auxiliary task that significantly improves visual grounding within language features by explicitly teaching the model to learn fine-grained correspondence between masked textual tokens and their matching visual objects. Mask Grounding can be directly used on prior RIS methods and consistently bring improvements. Furthermore to holistically address the modality gap we also design a cross-modal alignment loss and an accompanying alignment module. These additions work synergistically with Mask Grounding. With all these techniques our comprehensive approach culminates in MagNet (Mask-grounded Network) an architecture that significantly outperforms prior arts on three key benchmarks (RefCOCO RefCOCO+ and G-Ref) demonstrating our method's effectiveness in addressing current limitations of RIS algorithms. Our code and pre-trained weights will be released.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Chng_2024_CVPR, author = {Chng, Yong Xien and Zheng, Henry and Han, Yizeng and Qiu, Xuchong and Huang, Gao}, title = {Mask Grounding for Referring Image Segmentation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {26573-26583} }