Gated cross word-visual attention-driven generative adversarial networks for text-to-image synthesis
The main objective of text-to-image (Txt2Img) synthesis is to generate realistic images from text descriptions. We propose to insert a gated cross word-visual attention unit (GCAU) into the conventional multiple-stage generative adversarial network Txt2Img framework. Our GCAU consists of two key components. First, a cross word-visual attention mechanism is proposed to draw fine-grained details at different subregions of the image by focusing on the relevant words (via the visual-to-word attention), and select important words by paying attention to the relevant synthesized subregions of the image (via the word-to-visual attention). Second, a gated refinement mechanism is proposed to dynamically select important word information for refining the generated image. Extensive experiments are conducted to demonstrate the superior image generation performance of the proposed approach on CUB and MS-COCO benchmark datasets.