CLIPTER: Looking at the Bigger Picture in Scene Text Recognition

Aviad Aberdam, David Bensaid, Alona Golts, Roy Ganz, Oren Nuriel, Royee Tichauer, Shai Mazor, Ron Litman; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 21706-21717

Abstract


Reading text in real-world scenarios often requires understanding the context surrounding it, especially when dealing with poor-quality text. However, current scene text recognizers are unaware of the bigger picture as they operate on cropped text images. In this study, we harness the representative capabilities of modern vision-language models, such as CLIP, to provide scene-level information to the crop-based recognizer. We achieve this by fusing a rich representation of the entire image, obtained from the vision-language model, with the recognizer word-level features via a gated cross-attention mechanism. This component gradually shifts to the context-enhanced representation, allowing for stable fine-tuning of a pretrained recognizer. We demonstrate the effectiveness of our model-agnostic framework, CLIPTER (CLIP TExt Recognition), on leading text recognition architectures and achieve state-of-the-art results across multiple benchmarks. Furthermore, our analysis highlights improved robustness to out-of-vocabulary words and enhanced generalization in low-data regimes.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Aberdam_2023_ICCV, author = {Aberdam, Aviad and Bensaid, David and Golts, Alona and Ganz, Roy and Nuriel, Oren and Tichauer, Royee and Mazor, Shai and Litman, Ron}, title = {CLIPTER: Looking at the Bigger Picture in Scene Text Recognition}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {21706-21717} }