Decomposing Food Images for Better Nutrition Analysis: A Nutritionist-Inspired Two-Step Multimodal LLM Approach

Khlaisamniang, Pitikorn; Kerdthaisong, Kun; Vorathammathorn, Supasate; Yongsatianchot, Nutchanon; Phimsiri, Hirunkul; Chinkamol, Amrest; Thitseesaeng, Teermade; Veerakanjana, Kanyakorn; Kachai, Kaisorn; Ittichaiwong, Piyalitt; Saengja, Tossaporn

Pitikorn Khlaisamniang, Kun Kerdthaisong, Supasate Vorathammathorn, Nutchanon Yongsatianchot, Hirunkul Phimsiri, Amrest Chinkamol, Teermade Thitseesaeng, Kanyakorn Veerakanjana, Kaisorn Kachai, Piyalitt Ittichaiwong, Tossaporn Saengja; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2025, pp. 482-491

Abstract

Accurate estimation of nutritional information from food images remains a challenging problem. Most existing approaches rely on deep image models fine-tuned with extensive food annotations or require detailed user inputs (e.g., portion size, cooking method), both of which are prone to error. Motivated by the workflow of nutrition experts, we propose a two-step prompting framework leveraging off-the-shelf Multimodal Large Language Models (MLLMs). The first step deconstructs the dish into its components listing major ingredients, portion sizes, and cooking details while the second step computes total calories and macronutrients. This approach alleviates the need for heavy fine-tuning or large ingredient databases, by instead harnessing the compositional reasoning capabilities of general MLLMs. We evaluate the method on both a subset of the Nutrition5k dataset (Nutrition320) and real-world samples from the Gindee application (Gindee121), achieving more accurate estimates than one-step direct queries. Additional experiments with visual prompts (bounding boxes, segmentation masks) further demonstrate the robustness and adaptability of our approach. Notably, our findings reveal that guiding MLLMs through a structured two-step reasoning process--separating "what is on the plate" from "how it translates nutritionally"--substantially improves the reliability of image-based macronutrient estimation.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Khlaisamniang_2025_CVPR, author = {Khlaisamniang, Pitikorn and Kerdthaisong, Kun and Vorathammathorn, Supasate and Yongsatianchot, Nutchanon and Phimsiri, Hirunkul and Chinkamol, Amrest and Thitseesaeng, Teermade and Veerakanjana, Kanyakorn and Kachai, Kaisorn and Ittichaiwong, Piyalitt and Saengja, Tossaporn}, title = {Decomposing Food Images for Better Nutrition Analysis: A Nutritionist-Inspired Two-Step Multimodal LLM Approach}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2025}, pages = {482-491} }