Instruction-based Image Manipulation by Watching How Things Move

Mingdeng Cao, Xuaner Zhang, Yinqiang Zheng, Zhihao Xia; Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 2704-2713

Abstract


This paper introduces a novel dataset construction pipeline that samples pairs of frames from videos and uses multimodal large language models (MLLMs) to generate editing instructions for training instruction-based image manipulation models. Video frames inherently preserve the identity of subjects and scenes, ensuring consistent content preservation during editing. Additionally, video data captures diverse, natural dynamics--such as non-rigid subject motion and complex camera movements--that are difficult to model otherwise, making it an ideal source for scalable dataset construction. Using this approach, we create a new dataset to train InstructMove, a model capable of instruction-based complex manipulations that are difficult to achieve with synthetically generated datasets. Our model demonstrates state-of-the-art performance in tasks such as adjusting subject poses, rearranging elements, and altering camera perspectives.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Cao_2025_CVPR, author = {Cao, Mingdeng and Zhang, Xuaner and Zheng, Yinqiang and Xia, Zhihao}, title = {Instruction-based Image Manipulation by Watching How Things Move}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {2704-2713} }