I Dream My Painting: Connecting MLLMs and Diffusion Models via Prompt Generation for Text-Guided Multi-Mask Inpainting

Nicola Fanelli, Gennaro Vessio, Giovanna Castellano; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 6073-6082

Abstract


Inpainting focuses on filling missing or corrupted regions of an image to blend seamlessly with its surrounding content and style. While conditional diffusion models have proven effective for text-guided inpainting we introduce the novel task of multi-mask inpainting where multiple regions are simultaneously inpainted using distinct prompts. Furthermore we design a fine-tuning procedure for multimodal LLMs such as LLaVA to generate multi-mask prompts automatically using corrupted images as inputs. These models can generate helpful and detailed prompt suggestions for filling the masked regions. The generated prompts are then fed to Stable Diffusion which is fine-tuned for the multi-mask inpainting problem using rectified cross-attention enforcing prompts onto their designated regions for filling. Experiments on digitized paintings from WikiArt and the Densely Captioned Images dataset demonstrate that our pipeline delivers creative and accurate inpainting results. Our code data and trained models are available at https://cilabuniba.github.io/i-dream-my-painting.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Fanelli_2025_WACV, author = {Fanelli, Nicola and Vessio, Gennaro and Castellano, Giovanna}, title = {I Dream My Painting: Connecting MLLMs and Diffusion Models via Prompt Generation for Text-Guided Multi-Mask Inpainting}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {6073-6082} }