MarkovGen: Structured Prediction for Efficient Text-to-Image Generation

Sadeep Jayasumana, Daniel Glasner, Srikumar Ramalingam, Andreas Veit, Ayan Chakrabarti, Sanjiv Kumar; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 9316-9325

Abstract


Modern text-to-image generation models produce high-quality images that are both photorealistic and faithful to the text prompts. However this quality comes at significant computational cost: nearly all of these models are iterative and require running sampling multiple times with large models. This iterative process is needed to ensure that different regions of the image are not only aligned with the text prompt but also compatible with each other. In this work we propose a light-weight approach to achieving this compatibility between different regions of an image using a Markov Random Field (MRF) model. We demonstrate the effectiveness of this method on top of the latent token-based Muse text-to-image model. The MRF richly encodes the compatibility among image tokens at different spatial locations to improve quality and significantly reduce the required number of Muse sampling steps. Inference with the MRF is significantly cheaper and its parameters can be quickly learned through back-propagation by modeling MRF inference as a differentiable neural-network layer. Our full model MarkovGen uses this proposed MRF model to both speed up Muse by 1.5xand produce higher quality images by decreasing undesirable image artifacts.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Jayasumana_2024_CVPR, author = {Jayasumana, Sadeep and Glasner, Daniel and Ramalingam, Srikumar and Veit, Andreas and Chakrabarti, Ayan and Kumar, Sanjiv}, title = {MarkovGen: Structured Prediction for Efficient Text-to-Image Generation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {9316-9325} }