Cross-modal Feature Alignment and Fusion for Composed Image Retrieval

Yongquan Wan, Wenhai Wang, Guobing Zou, Bofeng Zhang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 8384-8388

Abstract


Composed Image Retrieval (CIR) presents challenges in expressing search intent through hybrid-modality queries where users search for a target image using another image along with text to modify certain attributes of the images. CIR encounters two main challenges: cross-modal alignment and feature fusion due to inherent gaps between images and texts. To address these issues we decompose the CIR task into a two-stage process and propose the cross-modal feature alignment and fusion model (CAFF). We first fine-tune CLIP's encoders for domain-specific tasks to learn fine-grained domain knowledge for image retrieval. In the subsequent stage we enhance the pre-trained model for CIR. Our model incorporates the Image-Guided Global Fusion (IGGF) Text-Guided Global Fusion (TGGF) and Adaptive Combiner (AC) modules. IGGF and TGGF integrate complementary information through intra-modal and inter-modal interactions discerning alterations in the query image compared to the target image. The AC module balances contributions yielding the final compositional representation. Extensive experiments on three benchmark datasets demonstrate our model's superiority over state-of-the-art models.

Related Material


[pdf]
[bibtex]
@InProceedings{Wan_2024_CVPR, author = {Wan, Yongquan and Wang, Wenhai and Zou, Guobing and Zhang, Bofeng}, title = {Cross-modal Feature Alignment and Fusion for Composed Image Retrieval}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {8384-8388} }