Retrieval Augmented Recipe Generation

Liu, Guoshan; Yin, Hailong; Zhu, Bin; Chen, Jingjing; Ngo, Chong-Wah; Jiang, Yu-Gang

Guoshan Liu, Hailong Yin, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, Yu-Gang Jiang; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 2453-2463

Abstract

The growing interest in generating recipes from food images has drawn substantial research attention in recent years. Existing works for recipe generation primarily utilize a two-stage training method--first predicting ingredients from a food image and then generating instructions from both the image and ingredients. Large Multi-modal Models (LMMs) which have achieved notable success across a variety of vision and language tasks shed light on generating both ingredients and instructions directly from images. Nevertheless LMMs still face the common issue of hallucinations during recipe generation leading to suboptimal performance. To tackle this issue we propose a retrieval augmented large multimodal model for recipe generation. We first introduce Stochastic Diversified Retrieval Augmentation (SDRA) to retrieve recipes semantically related to the image from an existing datastore as a supplement integrating them into the prompt to add diverse and rich context to the input image. Additionally Self-Consistency Ensemble Voting mechanism is proposed to determine the most confident prediction recipes as the final output. It calculates the consistency among generated recipe candidates which use different retrieval recipes as context for generation. Extensive experiments validate the effectiveness of our proposed method which demonstrates state-of-the-art (SOTA) performance in recipe generation on the Recipe1M dataset.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Liu_2025_WACV, author = {Liu, Guoshan and Yin, Hailong and Zhu, Bin and Chen, Jingjing and Ngo, Chong-Wah and Jiang, Yu-Gang}, title = {Retrieval Augmented Recipe Generation}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {2453-2463} }