Conditioned and Composed Image Retrieval Combining and Partially Fine-Tuning CLIP-Based Features

Alberto Baldrati, Marco Bertini, Tiberio Uricchio, Alberto Del Bimbo; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2022, pp. 4959-4968

Abstract


In this paper, we present an approach for conditioned and composed image retrieval based on CLIP features. In this extension of content-based image retrieval (CBIR) an image is combined with a text that provides information regarding user intentions, and is relevant for application domains like e-commerce. The proposed method is based on an initial training stage where a simple combination of visual and textual features is used, to fine-tune the CLIP text encoder. Then in a second training stage we learn a more complex combiner network that merges visual and textual features. Contrastive learning is used in both stages. The proposed approach obtains state-of-the-art performance for conditioned CBIR on the FashionIQ dataset and for composed CBIR on the more recent CIRR dataset.

Related Material


[pdf]
[bibtex]
@InProceedings{Baldrati_2022_CVPR, author = {Baldrati, Alberto and Bertini, Marco and Uricchio, Tiberio and Del Bimbo, Alberto}, title = {Conditioned and Composed Image Retrieval Combining and Partially Fine-Tuning CLIP-Based Features}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2022}, pages = {4959-4968} }