CIGLI: Conditional Image Generation From Language & Image

Lu, Xiaopeng; Ng, Lynnette; Fernandez, Jared; Zhu, Hao

Xiaopeng Lu, Lynnette Ng, Jared Fernandez, Hao Zhu; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2021, pp. 3134-3138

Abstract

Multi-modal generation has been widely explored in recent years. Current research directions involve generating text based on an image or vice versa. In this paper, we propose a new task called CIGLI: Conditional Image Generation from Language and Image. Instead of generating an image based on text as in text-image generation, this task requires the generation of an image from a textual description and an image prompt. We designed a new dataset to ensure that the text description describes information from both images, and that solely analyzing the description is insufficient to generate an image. We then propose a novel language-image fusion model which improves the performance over two established baseline methods, as evaluated by quantitative (automatic) and qualitative (human) evaluations. The code and dataset is available at https://github.com/vincentlux/CIGLI.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Lu_2021_ICCV, author = {Lu, Xiaopeng and Ng, Lynnette and Fernandez, Jared and Zhu, Hao}, title = {CIGLI: Conditional Image Generation From Language \& Image}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2021}, pages = {3134-3138} }