Visual Language Model-based Food Safety Support for Persons with Blindness and Low Vision

Hansen, Ryan; Setia, Hardik; Hamilton-Fletcher, Giles; Jain, Aryan; Liu, Zirui; Zoair, Mariam; Aboutaleb, Reem; Wen, Qing; Li, Yu; Rizzo, John Ross

Ryan Hansen, Hardik Setia, Giles Hamilton-Fletcher, Aryan Jain, Zirui Liu, Mariam Zoair, Reem Aboutaleb, Qing Wen, Yu Li, John Ross Rizzo; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2025, pp. 2524-2533

Abstract

Persons with blindness and low vision (pBLV) often lack access to visual indicators of food spoilage, which increases their risk of foodborne illnesses. To assess whether Visual Language Models (VLMs) could assist pBLV on food safety tasks we assessed four advanced VLMs -- ChatGPT-4 Vision, LLaVA-OneVision, Gemini 1.5 flash, and VisPercep. VLMs were tasked with answering 'Is this safe to eat?' for each image in a 1) fresh vs rotten food task (120 images, 6 food types) and 2) a time-lapse sequence task of food spoiling (352 images, 16 food types). Here we report 'true positives' ("safe"+fresh), 'true negatives' ("unsafe"+rotten), 'false negatives' ("unsafe"+fresh) and crucially, 'false positives' (FP, "safe"+rotten) which are dangerous for the user. For task 1, VisPercep had the highest overall accuracy at 87.50% and always provided a definitive yes/no answer, but also had 13 false positives (21.67%); By contrast, ChatGPT-4 Vision was the safest, with an accuracy of 84.17%, only 1 false positive (1.67%), but gave no choice on 17/120 trials. Gemini had 84.17% accuracy, 12 FP, and 4 no choices, while LLaVA had the lowest performance with 46.67% accuracy, 16 FP, and 44 no choices. For the timelapse data, we compared VLMs against the cutoff point for being 'safe to eat' according to majority rules from 5 human participants. Here we found that Gemini had the highest accuracy at 80.97% and 20 FP, while ChatGPT-4 Vision was safer with a 71.02% accuracy and 14 FP. VisPercep had a 70.74% accuracy, with 96 FP, and LLaVA had 60.80% accuracy with 127 FP. Overall, we show large variations in the performance profile of these VLMs and highlight key issues that will need to be addressed to improve their food safety judgments for pBLV in the future.

Related Material

[pdf]

[bibtex]

@InProceedings{Hansen_2025_ICCV, author = {Hansen, Ryan and Setia, Hardik and Hamilton-Fletcher, Giles and Jain, Aryan and Liu, Zirui and Zoair, Mariam and Aboutaleb, Reem and Wen, Qing and Li, Yu and Rizzo, John Ross}, title = {Visual Language Model-based Food Safety Support for Persons with Blindness and Low Vision}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {2524-2533} }