Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions

Zeyu Han, Fangrui Zhu, Qianru Lao, Huaizu Jiang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14364-14374

Abstract


Zero-shot referring expression comprehension aims at localizing bounding boxes in an image corresponding to provided textual prompts which requires: (i) a fine-grained disentanglement of complex visual scene and textual context and (ii) a capacity to understand relationships among disentangled entities. Unfortunately existing large vision-language alignment (VLA) models e.g. CLIP struggle with both aspects so cannot be directly used for this task. To mitigate this gap we leverage large foundation models to disentangle both images and texts into triplets in the format of (subject predicate object). After that grounding is accomplished by calculating the structural similarity matrix between visual and textual triplets with a VLA model and subsequently propagate it to an instance-level similarity matrix. Furthermore to equip VLA models with the ability of relationship understanding we design a triplet-matching objective to fine-tune the VLA models on a collection of curated dataset containing abundant entity relationships. Experiments demonstrate that our visual grounding performance increase of up to 19.5% over the SOTA zero-shot model on RefCOCO/+/g. On the more challenging Who's Waldo dataset our zero-shot approach achieves comparable accuracy to the fully supervised model. Code is available at https://github.com/Show-han/Zeroshot_REC.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Han_2024_CVPR, author = {Han, Zeyu and Zhu, Fangrui and Lao, Qianru and Jiang, Huaizu}, title = {Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {14364-14374} }