-
[pdf]
[arXiv]
[bibtex]@InProceedings{Li_2025_CVPR, author = {Li, Mingcheng and Hou, Xiaolu and Liu, Ziyang and Yang, Dingkang and Qian, Ziyun and Chen, Jiawei and Wei, Jinjie and Jiang, Yue and Xu, Qingyao and Zhang, Lihua}, title = {MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {13263-13272} }
MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation
Abstract
Diffusion models have shown excellent performance in text-to-image generation. However, existing methods often suffer from performance bottlenecks when dealing with complex prompts involving multiple objects, characteristics, and relations. Therefore, we propose a Multi-agent Collaboration-based Compositional Diffusion (MCCD) for text-to-image generation for complex scenes. Specifically, we design a multi-agent collaboration based scene parsing module that generates an agent system containing multiple agents with different tasks using MLLMs to adequately extract multiple scene elements. In addition, Hierarchical Compositional diffusion utilizes Gaussian mask and filtering to achieve the refinement of bounding box regions and highlights objects through region enhancement for accurate and high-fidelity generation of complex scenes. Comprehensive experiments demonstrate that our MCCD significantly improves the performance of the baseline models in a training-free manner, which has a large advantage in complex scene generation. The code will be open-source on github.
Related Material