-
[pdf]
[bibtex]@InProceedings{Yan_2024_ACCV, author = {Yan, Cairong and Ma, Meng and Zhang, Yanting and Wan, Yongquan}, title = {Dual-path Multimodal Optimal Transport for Composed Image Retrieval}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2024}, pages = {1741-1755} }
Dual-path Multimodal Optimal Transport for Composed Image Retrieval
Abstract
Unlike cross-modal retrieval tasks like text-to-image and image-to-text, which focus on one-way feature alignment, composed image retrieval emphasizes bidirectional feature alignment to differentiate between features that need to be preserved or modified. Existing methods usually map text and image modal data directly into a shared space for fusion, overlooking the issue of mismatched feature distributions between the source and target domains, resulting in limited retrieval performance. This paper presents the Dual-path Multimodal Optimal Transport (DMOT) model for composed image retrieval. It aligns features independently from both the text-to-image and image-to-text paths. During the fusion process, it explicitly calculates the preserved and modified features. Specifically, the pre-trained vision-language model BLIP is used to extract deep semantic features of both images and text. Then, we utilize two optimal transport modules to iteratively optimize and solve mapping matrices, aligning reference image features and modified text features in their respective spaces. Finally, considering the characteristics of the composed image retrieval task, we design the feature modifier module and the feature preserver module to handle the fusion of multimodal features. Extensive experiments on two public datasets, FashionIQ and CIRR, demonstrate DMOTs superior performance over state-of-the-art methods in retrieval accuracy, achieving an average improvement of 6.81% and 16.09%, respectively. The source code is available at https://github.com/AnoAuth/DMOT
Related Material