Enhancing Anchor-based Weakly Supervised Referring Expression Comprehension with Cross-Modality Attention

Ting-Yu Chu, Yong-Xiang Lin, Ching-Chun Huang, Kai-Lung Hua; Proceedings of the Asian Conference on Computer Vision (ACCV), 2024, pp. 2767-2783

Abstract


Weakly supervised Referring Expression Comprehension (REC) tackles the challenge of identifying specific regions in an image based on textual descriptions without predefined mappings between the text and target objects during training. The primary obstacle lies in the misalignment between visual and textual features, often resulting in inaccurate bounding box predictions. To address this, we propose a novel cross-modality attention module (CMA) module that enhances the discriminative power of grid features and improves localization accuracy by harmonizing textual and visual features. To handle the noise from incorrect labels common in weak supervision, we also introduce a false negative suppression mechanism that uses intra-modal similarities as soft supervision signals. Extensive experiments conducted on four REC benchmark datasets: RefCOCO, RefCOCO+, RefCOCOg, and ReferItGame. Our results show that our model consistently outperforms state-of-the-art methods in accuracy and generalizability.

Related Material


[pdf]
[bibtex]
@InProceedings{Chu_2024_ACCV, author = {Chu, Ting-Yu and Lin, Yong-Xiang and Huang, Ching-Chun and Hua, Kai-Lung}, title = {Enhancing Anchor-based Weakly Supervised Referring Expression Comprehension with Cross-Modality Attention}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2024}, pages = {2767-2783} }