Sequential Transformer for End-to-End Video Text Detection

Jun-Bo Zhang, Meng-Biao Zhao, Fei Yin, Cheng-Lin Liu; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 6520-6530

Abstract


In existing methods of video text detection, the detection and tracking branches are usually independent of each other, and although they jointly optimize the backbone network, the tracking-by-detection paradigm still needs to be used during the inference stage. To address this issue, we propose a novel video text detection framework based on sequential transformer, which decodes detection and tracking tasks in parallel, without explicitly setting up a tracking branch. To achieve this, we first introduce the concept of instance query, which learns long-term context information in the video sequence. Then, based on the instance query, the transformer decoder is used to predict the entire box and mask sequence of the text instance in one pass. As a result, the tracking task is realized naturally. In addition, the proposed method can be applied to the scene text detection task seamlessly, without modifying any modules. To the best of our knowledge, this is the first framework to unify the tasks of scene text detection and video text detection. Our model achieves state-of-the-art performance on four video text datasets (YVT, RT-1K, BOVText, and BiRViT-1K), and competitive results on three scene text datasets (CTW1500, MSRA-TD500, and Total-Text). The code is available at https://github.com/zjb-1/SeqVideoText.

Related Material


[pdf]
[bibtex]
@InProceedings{Zhang_2024_WACV, author = {Zhang, Jun-Bo and Zhao, Meng-Biao and Yin, Fei and Liu, Cheng-Lin}, title = {Sequential Transformer for End-to-End Video Text Detection}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2024}, pages = {6520-6530} }