- [pdf] [supp]
Textual Alchemy: CoFormer for Scene Text Understanding
The paper presents CoFormer (Convolutional Fourier Transformer), a robust and adaptable transformer architecture designed for a range of scene text tasks. CoFormer integrates convolution and Fourier operations into the transformer architecture. Thus, it leverages convolution properties such as shared weights, local receptive fields, and spatial subsampling, while the Fourier operation emphasizes composite characteristics from the frequency domain. The research further proposes the first pretraining datasets, named Textverse10M-E and Textverse10M-H. Using these datasets, we demonstrate the efficacy of pretraining for scene text understanding. CoFormer achieves state-of-theart results with and without pretraining on two downstream tasks: scene text recognition and scene text style transfer. The paper presents LISTNet (Language Invariant Style Transfer), a novel framework for bi-lingual scene text style transfer. It also introduces three datasets, viz., TST500K for scene text style transfer, CSTR2.5M and Akshara550 for scene text recognition.