OTE: Exploring Accurate Scene Text Recognition Using One Token

Jianjun Xu, Yuxin Wang, Hongtao Xie, Yongdong Zhang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 28327-28336

Abstract


In this paper we propose a novel framework to fully exploit the potential of a single vector for scene text recognition (STR). Different from previous sequence-to-sequence methods that rely on a sequence of visual tokens to represent scene text images we prove that just one token is enough to characterize the entire text image and achieve accurate text recognition. Based on this insight we introduce a new paradigm for STR called One Token rEcognizer (OTE). Specifically we implement an image-to-vector encoder to extract the fine-grained global semantics eliminating the need for sequential features. Furthermore an elegant yet potent vector-to-sequence decoder is designed to adaptively diffuse global semantics to corresponding character locations enabling both autoregressive and non-autoregressive decoding schemes. By executing decoding within a high-level representational space our vector-to-sequence (V2S) approach avoids the alignment issues between visual tokens and character embeddings prevalent in traditional sequence-to-sequence methods. Remarkably due to introducing character-wise fine-grained information such global tokens also boost the performance of scene text retrieval tasks. Extensive experiments on synthetic and real datasets demonstrate the effectiveness of our method by achieving new state-of-the-art results on various public STR benchmarks. Our code is available at https://github.com/Xu-Jianjun/OTE.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Xu_2024_CVPR, author = {Xu, Jianjun and Wang, Yuxin and Xie, Hongtao and Zhang, Yongdong}, title = {OTE: Exploring Accurate Scene Text Recognition Using One Token}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {28327-28336} }