Textual Alchemy: CoFormer for Scene Text Understanding

Gayatri Deshmukh, Onkar Susladkar, Dhruv Makwana, Sparsh Mittal, Sai Chandra Teja R.; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 2931-2941


The paper presents CoFormer (Convolutional Fourier Transformer), a robust and adaptable transformer architecture designed for a range of scene text tasks. CoFormer integrates convolution and Fourier operations into the transformer architecture. Thus, it leverages convolution properties such as shared weights, local receptive fields, and spatial subsampling, while the Fourier operation emphasizes composite characteristics from the frequency domain. The research further proposes the first pretraining datasets, named Textverse10M-E and Textverse10M-H. Using these datasets, we demonstrate the efficacy of pretraining for scene text understanding. CoFormer achieves state-of-theart results with and without pretraining on two downstream tasks: scene text recognition and scene text style transfer. The paper presents LISTNet (Language Invariant Style Transfer), a novel framework for bi-lingual scene text style transfer. It also introduces three datasets, viz., TST500K for scene text style transfer, CSTR2.5M and Akshara550 for scene text recognition.

Related Material

[pdf] [supp]
@InProceedings{Deshmukh_2024_WACV, author = {Deshmukh, Gayatri and Susladkar, Onkar and Makwana, Dhruv and Mittal, Sparsh and R., Sai Chandra Teja}, title = {Textual Alchemy: CoFormer for Scene Text Understanding}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2024}, pages = {2931-2941} }