Bottom-Up Shift and Reasoning for Referring Image Segmentation
Referring image segmentation aims to segment the referent that is the corresponding object or stuff referred by a natural language expression in an image. Its main challenge lies in how to effectively and efficiently differentiate between the referent and other objects of the same category as the referent. In this paper, we tackle the challenge by jointly performing compositional visual reasoning and accurate segmentation in a single stage via the proposed novel Bottom-Up Shift (BUS) and Bidirectional Attentive Refinement (BIAR) modules. Specifically, BUS progressively locates the referent along hierarchical reasoning steps implied by the expression. At each step, it locates the corresponding visual region by disambiguating between similar regions, where the disambiguation bases on the relationships between regions. By the explainable visual reasoning, BUS explicitly aligns linguistic components with visual regions so that it can identify all the mentioned entities in the expression. BIAR fuses multi-level features via a two-way attentive message passing, which captures the visual details relevant to the referent to refine segmentation results. Experimental results demonstrate that the proposed method consisting of BUS and BIAR modules, can not only consistently surpass all existing state-of-the-art algorithms across common benchmark datasets but also visualize interpretable reasoning steps for stepwise segmentation. Code is available at https://github.com/incredibleXM/BUSNet.