IRR-LMM: Improving On-demand Retail Recommendation with Large Multi-Modal Models

Zhao, Yihao; Lai, Nan; Li, Xiaoming; Yan, Xu; Deng, Wenhao; Huang, Hujiang; Zhang, Shuai; Lin, Wei

Yihao Zhao, Nan Lai, Xiaoming Li, Xu Yan, Wenhao Deng, Hujiang Huang, Shuai Zhang, Wei Lin; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2025, pp. 4186-4195

Abstract

Most of the previous recommendation systems are primarily relied on item IDs, which often hindered accurate recommendations and cold-start scenarios. We propose a workflow that integrates multi-modal information into large-scale industrial systems. The multi-modal feature brings sematic information into recommendation systems and improve its performance. Our approach utilizes the Beit-3 model for extracting and fusing multi-modal representations. To train the multi-modal representation extracting model, we design a pre-training and fine-tuning framework based on Moco. We implement a two-stage architecture where multi-modal representations are generated and then applied to recommendation tasks. We also design an evaluation system to ensure the quality of these representations before applying them to the ranking model. Furthermore, we leverage the multi-modal feature in user behavior sequences, which is categorized into various actions, to improve prediction accuracy and recommendation performance. For each of our improvements, we conducted detailed experiments, and the results indicated that each improvement contributed to enhancing the model's performance.

Related Material

[pdf]

[bibtex]

@InProceedings{Zhao_2025_ICCV, author = {Zhao, Yihao and Lai, Nan and Li, Xiaoming and Yan, Xu and Deng, Wenhao and Huang, Hujiang and Zhang, Shuai and Lin, Wei}, title = {IRR-LMM: Improving On-demand Retail Recommendation with Large Multi-Modal Models}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {4186-4195} }