-
[pdf]
[supp]
[bibtex]@InProceedings{Qin_2025_CVPR, author = {Qin, Xugong and Zhang, Peng and Yang, Jun Jie Ou and Zeng, Gangyan and Li, Yubo and Wang, Yuanyuan and Zhang, Wanqian and Dai, Pengwen}, title = {CLIP is Almost All You Need: Towards Parameter-Efficient Scene Text Retrieval without OCR}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {24873-24883} }
CLIP is Almost All You Need: Towards Parameter-Efficient Scene Text Retrieval without OCR
Abstract
Scene Text Retrieval (STR) seeks to identify all images containing a given query string. Existing methods typically rely on an explicit Optical Character Recognition (OCR) process of text spotting or localization, which is susceptible to complex pipelines and accumulated errors. To settle this, we resort to the Contrastive Language-Image Pre-training (CLIP) models, which have demonstrated the capacity to perceive and understand scene text, making it possible to achieve strictly OCR-free STR. From the perspective of parameter-efficient transfer learning, a lightweight visual position adapter is proposed to provide a positional information complement for the visual encoder. Besides, we introduce a visual context dropout technique to improve the alignment of local visual features. A novel, parameter-free cross-attention mechanism transfers the contrastive relationship between images and text to that between visual tokens and text, producing a rich cross-modal representation, which can be utilized for efficient reranking with a linear classifier. The resulting model, CAYN, which proves that CLIP is Almost all You Need for STR with no more than 0.50M additional parameters required, achieves new state-of-the-art performance on the STR task, with 92.46%/89.49%/85.98% mAP on the SVT/IIIT-STR/TTR datasets. Our findings demonstrate that CLIP can serve as a reliable and efficient solution for OCR-free STR.
Related Material