Looking from a Higher-level Perspective: Attention and Recognition Enhanced Multi-scale Scene Text Segmentation
Scene text segmentation, which aims to generate pixel-level text masks, is an integral part of many fine-grained text tasks, such as text editing and text removal. Multi-scale irregular scene texts are often trapped in complex background noise around the image, and their textures are diverse and sometimes even similar to those of the background. These specific problems bring challenges that make general segmentation methods ineffective in the context of scene text. To tackle the aforementioned issues, we propose a new scene text segmentation pipeline called Attention and Recognition enhanced Multi-scale segmentation Network (ARM-Net), which consists of three main components: Text Segmentation Module (TSM) generates rectangular receptive fields of various sizes to fit scene text and integrate global information adequately; Dual Perceptual Decoder (DPD) strengthens the connection between pixels that belong to the same category from the spatial and channel perspective simultaneously during upsampling, and Recognition Enhanced Module (REM) provides text attention maps as a prior for the segmentation network, which can inherently distinguish text from background noise. Via extensive experiments, we demonstrate the effectiveness of each module of ARM-Net, and its performance surpasses that of existing state-of-the-art scene text segmentation methods. We also show that the pixel-level mask produced by our method can further improve the performance of text removal and scene text recognition.