Towards Language-Driven Video Inpainting via Multimodal Large Language Models

Wu, Jianzong; Li, Xiangtai; Si, Chenyang; Zhou, Shangchen; Yang, Jingkang; Zhang, Jiangning; Li, Yining; Chen, Kai; Tong, Yunhai; Liu, Ziwei; Loy, Chen Change

Jianzong Wu, Xiangtai Li, Chenyang Si, Shangchen Zhou, Jingkang Yang, Jiangning Zhang, Yining Li, Kai Chen, Yunhai Tong, Ziwei Liu, Chen Change Loy; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 12501-12511

Abstract

We introduce a new task -- language-driven video inpainting which uses natural language instructions to guide the inpainting process. This approach overcomes the limitations of traditional video inpainting methods that depend on manually labeled binary masks a process often tedious and labor-intensive. We present the Remove Objects from Videos by Instructions (ROVI) dataset containing 5650 videos and 9091 inpainting results to support training and evaluation for this task. We also propose a novel diffusion-based language-driven video inpainting framework the first end-to-end baseline for this task integrating Multimodal Large Language Models to understand and execute complex language-based inpainting requests effectively. Our comprehensive results showcase the dataset's versatility and the model's effectiveness in various language-instructed inpainting scenarios. We have made datasets code and models publicly available at https://github.com/jianzongwu/Language-Driven-Video-Inpainting.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Wu_2024_CVPR, author = {Wu, Jianzong and Li, Xiangtai and Si, Chenyang and Zhou, Shangchen and Yang, Jingkang and Zhang, Jiangning and Li, Yining and Chen, Kai and Tong, Yunhai and Liu, Ziwei and Loy, Chen Change}, title = {Towards Language-Driven Video Inpainting via Multimodal Large Language Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {12501-12511} }