ProVLA: Compositional Image Search with Progressive Vision-Language Alignment and Multimodal Fusion

Zhizhang Hu, Xinliang Zhu, Son Tran, René Vidal, Arnab Dhua; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2023, pp. 2772-2777

Abstract


Traditional image-to-image and text-to-image search struggles with comprehending complex user intentions, particularly in fashion e-commerce, where users search for similar products with modifications to a reference image. This paper introduces a novel approach, the Progressive Vision-language Alignment and Multimodal Fusion model (ProVLA), which utilizes both image and text inputs. The ProVLA applies Transformer-based vision and language models to generate multimodal embeddings. Our method involves a two-step learning process and a cross-attention-based fusion encoder to facilitate robust information fusion, and a momentum queue-based hard negative mining mechanism to handle noisy training data. Extensive evaluation on Fashion 200k and Shoes benchmark datasets demonstrates that our model outperforms the existing state-of-the-art methods.

Related Material


[pdf]
[bibtex]
@InProceedings{Hu_2023_ICCV, author = {Hu, Zhizhang and Zhu, Xinliang and Tran, Son and Vidal, Ren\'e and Dhua, Arnab}, title = {ProVLA: Compositional Image Search with Progressive Vision-Language Alignment and Multimodal Fusion}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2023}, pages = {2772-2777} }