ProVLA: Compositional Image Search with Progressive Vision-Language Alignment and Multimodal Fusion
Traditional image-to-image and text-to-image search struggles with comprehending complex user intentions, particularly in fashion e-commerce, where users search for similar products with modifications to a reference image. This paper introduces a novel approach, the Progressive Vision-language Alignment and Multimodal Fusion model (ProVLA), which utilizes both image and text inputs. The ProVLA applies Transformer-based vision and language models to generate multimodal embeddings. Our method involves a two-step learning process and a cross-attention-based fusion encoder to facilitate robust information fusion, and a momentum queue-based hard negative mining mechanism to handle noisy training data. Extensive evaluation on Fashion 200k and Shoes benchmark datasets demonstrates that our model outperforms the existing state-of-the-art methods.