Transformer-Based Text Detection in the Wild
A major limitation to most state-of-the-art visual localization methods is their ineptitude to make use of ubiquitous signs and directions that are typically intuitive to humans. Localization methods can greatly benefit from a system capable of reasoning about a variety of cues beyond low-level features, such as street signs, store names, building directories, room numbers, etc. In this work, we tackle the problem of text detection in the wild, an essential step towards achieving text-based localization and mapping. While current state-of-the-art text detection methods employ ad-hoc solutions with complex multi-stage components to solve the problem, we propose a Transformer-based architecture inherently capable of dealing with multi-oriented texts in images. A central contribution to our work is the introduction of a loss function tailored to the rotated text detection problem that leverages a rotated version of a generalized intersection over union score to properly capture the rotated text regions. We evaluate our proposed model qualitatively and quantitatively on several challenging datasets namely, ICDAR15, ICDAR17, and MSRA-TD500, and show that it outperforms current state-of-the-art methods in text detection in the wild.