Cross-view Semantic Alignment for Livestreaming Product Recognition

Wenjie Yang, Yiyi Chen, Yan Li, Yanhua Cheng, Xudong Liu, Quan Chen, Han Li; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 13404-13413

Abstract


Live commerce is the act of selling products online through livestreaming. The customer's diverse demands for online products introduces more challenges to Livestreaming Product Recognition. Previous works are either focus on fashion clothing data or subject to single-modal input, thus inconsistent with the real-world scenario where multimodal data from various categories are present. In this paper, we contribute LPR4M, a large-scale multimodal dataset that covers 34 categories, comprises 3 modalities (image, video, and text), and is 50 times larger than the largest publicly available dataset. In addition, LPR4M contains diverse videos and noise modality pair while also having a long-tailed distribution, resembling real-world problems. Moreover, a cRoss-vIew semantiC alignmEnt (RICE) model is proposed to learn discriminative instance features from the two views (image and video) of products via instance-level contrastive learning as well as cross-view patch-level feature propagation. A novel Patch Feature Reconstruction loss is proposed to penalize the semantic misalignment between the cross-view patches. Extensive ablation studies demonstrate the effectiveness of RICE and provide insights into the importance of dataset diversity and expressivity.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Yang_2023_ICCV, author = {Yang, Wenjie and Chen, Yiyi and Li, Yan and Cheng, Yanhua and Liu, Xudong and Chen, Quan and Li, Han}, title = {Cross-view Semantic Alignment for Livestreaming Product Recognition}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {13404-13413} }