De-noised Vision-language Fusion Guided by Visual Cues for E-commerce Product Search

Zhizhang Hu, Shasha Li, Ming Du, Arnab Dhua, Douglas Gray; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 1986-1996

Abstract


In e-commerce applications vision-language multimodal transformer models play a pivotal role in product search. The key to successfully training a multimodal model lies in the alignment quality of image-text pairs in the dataset. However the data in practice is often automatically collected with minimal manual intervention. Hence the alignment of image-text pairs is far from ideal. In e-commerce this misalignment can stem from noisy and redundant non-visual-descriptive text attributes in the product description. To address this we introduce the MultiModal alignment-guided Learned Token Pruning (MM-LTP) method. MM-LTP employs token pruning conventionally used for computational efficiency to perform online text cleaning during multimodal model training. By enabling the model to discern and discard unimportant tokens it is able to train with implicitly cleaned image-text pairs. We evaluate MM-LTP using a benchmark multimodal e-commerce dataset comprising over 710000 unique Amazon products. Our evaluation hinges on visual search a prevalent e-commerce feature. Through MM-LTP we demonstrate that refining text tokens enhances the paired image branch's training which leads to significantly improved visual search performance.

Related Material


[pdf]
[bibtex]
@InProceedings{Hu_2024_CVPR, author = {Hu, Zhizhang and Li, Shasha and Du, Ming and Dhua, Arnab and Gray, Douglas}, title = {De-noised Vision-language Fusion Guided by Visual Cues for E-commerce Product Search}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {1986-1996} }