-
[pdf]
[supp]
[bibtex]@InProceedings{Trabelsi_2025_WACV, author = {Trabelsi, Ameni and Zontak, Maria and Qian, Yiming and Jackson, Brian and Khan, Suleiman and Batur, Umit}, title = {What Matters when Building Vision Language Models for Product Image Analysis?}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV) Workshops}, month = {February}, year = {2025}, pages = {1372-1381} }
What Matters when Building Vision Language Models for Product Image Analysis?
Abstract
This paper investigates multi-modal large language models (MLLMs) for predicting product features from images comparing fine-tuned versus proprietary models. We introduce two domain-specific benchmarks: (1) Inductive Bias vs. Image Evidence (IBIE) Benchmark which evaluates MLLMs' ability to distinguish between image-derived features and latent knowledge and (2) Catalog-bench which assesses feature prediction using Catalog terminology. Our fine-tuned model outperforms proprietary models like Gemini by 9.4% and 29.13% on these benchmarks respectively. We address the crucial aspect of computational efficiency exploring cost-effective deployment solutions under limited hardware resources. The significance of this work extends beyond e-commerce to physical retail where efficient MLLMs are essential for real-time processing of visual data from store cameras and shelf sensors. These models enable automated inventory management produce quality monitoring and planogram compliance while operating within in-store computing constraints. This capability is particularly valuable for physical retail environments where immediate decisions about restocking and quality control are critical while also enabling real-time assistance to customers seeking information about product details ingredients and nutritional content.
Related Material