-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Ye_2025_ICCV, author = {Ye, Xingsong and Du, Yongkun and Tao, Yunbo and Chen, Zhineng}, title = {TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {17464-17473} }
TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition
Abstract
Scene text recognition (STR) suffers from challenges of either less realistic synthetic training data or the difficulty of collecting sufficient high-quality real-world data, limiting the effectiveness of trained models. Meanwhile, despite producing holistically appealing text images, diffusion-based visual text generation methods struggle to synthesize accurate and realistic instance-level text at scale. To tackle this, we introduce TextSSR: a novel pipeline for Synthesizing Scene Text Recognition training data. TextSSR targets three key synthesizing characteristics: accuracy, realism, and scalability. It achieves accuracy through a proposed region-centric text generation with position-glyph enhancement, ensuring proper character placement. It maintains realism by guiding style and appearance generation using contextual hints from surrounding text or background. This character-aware diffusion architecture enjoys precise character-level control and semantic coherence preservation, without relying on natural language prompts. Therefore, TextSSR supports large-scale generation through combinatorial text permutations. Based on these, we present TextSSR-F, a dataset of 3.55 million quality-screened text instances. Extensive experiments show that STR models trained on TextSSR-F outperform those trained on existing synthetic datasets by clear margins on common benchmarks, and further improvements are observed when mixed with real-world training data. Code is available at https://github.com/YesianRohn/TextSSR.
Related Material
