LQMFormer: Language-aware Query Mask Transformer for Referring Image Segmentation

Nisarg A. Shah, Vibashan VS, Vishal M. Patel; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 12903-12913

Abstract


Referring Image Segmentation (RIS) aims to segment objects from an image based on a language description. Recent advancements have introduced transformer-based methods that leverage cross-modal dependencies significantly enhancing performance in referring segmentation tasks. These methods are designed such that each query predicts different masks. However RIS inherently requires a single-mask prediction leading to a phenomenon known as Query Collapse where all queries yield the same mask prediction. This reduces the generalization capability of the RIS model for complex or novel scenarios. To address this issue we propose a Multi-modal Query Feature Fusion technique characterized by two innovative designs: (1) Gaussian enhanced Multi-Modal Fusion a novel visual grounding mechanism that enhances overall representation by extracting rich local visual information and global visual-linguistic relationships and (2) A Dynamic Query Module that produces a diverse set of queries through a scoring network where the network selectively focuses on queries for objects referred to in the language description. Moreover we show that including an auxiliary loss to increase the distance between mask representations of different queries further enhances performance and mitigates query collapse. Extensive experiments conducted on four benchmark datasets validate the effectiveness of our framework.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Shah_2024_CVPR, author = {Shah, Nisarg A. and VS, Vibashan and Patel, Vishal M.}, title = {LQMFormer: Language-aware Query Mask Transformer for Referring Image Segmentation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {12903-12913} }