-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Luo_2025_ICCV, author = {Luo, Minxing and Xia, Zixun and Chen, Liaojun and Li, Zhenhang and Zeng, Weichao and Wang, Jianye and Cheng, Wentao and Wang, Yaxing and Zhou, Yu and Yang, Jian}, title = {Beyond Flat Text: Dual Self-inherited Guidance for Visual Text Generation}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {1937-1946} }
Beyond Flat Text: Dual Self-inherited Guidance for Visual Text Generation
Abstract
In real-world images, slanted or curved texts, especially those on cans, banners, or badges, appear as frequently, if not more so, than flat texts due to artistic design or layout constraints. While high-quality visual text generation has become available with the advanced generative capabilities of diffusion models, these models often produce distorted text and inharmonious text backgrounds when given slanted or curved text layouts due to training data limitations. In this paper, we propose a new framework, STGen, which accurately generates visual texts in challenging scenarios (e.g., slanted or curved text layouts) while harmonizing them with the text background. Our framework decomposes the visual text generation process into two branches: (i) Semantic Rectification Branch, which leverages the ability in generating flat but accurate visual texts of the model to guide the generation of challenging scenarios. The generated latent of flat text is abundant in accurate semantic information related to both the text itself and its background. By incorporating this, we rectify the semantic information of the texts and harmonize the integration of the text with its background in complex layouts. (ii) Structure Injection Branch, which reinforces the visual text structure during inference. We incorporate the latent information of the glyph image, rich in glyph structure, as a new condition to further strengthen the text structure. To enhance image harmony, we also apply an effective combination method to merge the priors, providing a solid foundation for generation. Extensive experiments across a variety of visual text layouts demonstrate that our framework achieves superior accuracy and outstanding quality.
Related Material
