Leveraging Large Language Models for Multimodal Search

Oriol Barbany, Michael Huang, Xinliang Zhu, Arnab Dhua; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 1201-1210

Abstract


Multimodal search has become increasingly important in providing users with a natural and effective way to express their search intentions. Images offer fine-grained details of the desired products while text allows for easily incorporating search modifications. However some existing multimodal search systems are unreliable and fail to address simple queries. The problem becomes harder with the large variability of natural language text queries which may contain ambiguous implicit and irrelevant information. Addressing these issues may require systems with enhanced matching capabilities reasoning abilities and context-aware query parsing and rewriting. This paper introduces a novel multimodal search model that achieves a new performance milestone on the Fashion200K dataset. Additionally we propose a novel search interface integrating Large Language Models (LLMs) to facilitate natural language interaction. This interface routes queries to search systems while conversationally engaging with users and considering previous searches. When coupled with our multimodal search model it heralds a new era of shopping assistants capable of offering human-like interaction and enhancing the overall search experience.

Related Material


[pdf] [arXiv]
[bibtex]
@InProceedings{Barbany_2024_CVPR, author = {Barbany, Oriol and Huang, Michael and Zhu, Xinliang and Dhua, Arnab}, title = {Leveraging Large Language Models for Multimodal Search}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {1201-1210} }