MM-IFEngine: Towards Multimodal Instruction Following

Shengyuan Ding, Shenxi Wu, Xiangyu Zhao, Yuhang Zang, Haodong Duan, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Dahua Lin, Jiaqi Wang; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 1099-1109

Abstract


The Instruction Following (IF) ability measures how well Multi-modal Large Language Models (MLLMs) understand exactly what users are telling them and doing it right.Existing multimodal instruction following training data is scarce, the benchmarks are simple with atomic instructions, and the evaluation strategies are imprecise for tasks demanding exact output constraints.To address this, we present MM-IFEngine, an effective pipeline to generate high-quality image-instruction pairs.Our MM-IFEngine pipeline yields large-scale, diverse, and high-quality training data MM-IFInstruct-23k, which is suitable for Supervised Fine-Tuning (SFT) and extended as MM-IFDPO-23k for Direct Preference Optimization (DPO).We further introduce MM-IFEval, a challenging and diverse multi-modal instruction-following benchmark that includes (1) both textual constraints for output responses and visual constraints tied to the input images, and (2) a comprehensive evaluation pipeline incorporating rule-based assessment and LLM-as-a-Judge evaluation.We conduct SFT and DPO experiments and demonstrate that fine-tuning MLLMs on MM-IFInstruct-23k and MM-IFDPO-23k achieve notable gains on various IF benchmarks, such as MM-IFEval (+11.8%), MIA (+7.7%), and IFEval (+10.5%).

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Ding_2025_ICCV, author = {Ding, Shengyuan and Wu, Shenxi and Zhao, Xiangyu and Zang, Yuhang and Duan, Haodong and Dong, Xiaoyi and Zhang, Pan and Cao, Yuhang and Lin, Dahua and Wang, Jiaqi}, title = {MM-IFEngine: Towards Multimodal Instruction Following}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {1099-1109} }