-
[pdf]
[bibtex]@InProceedings{Ho_2024_ACCV, author = {Ho, Ngoc-Vuong and Phan, Thinh and Adkins, Meredith and Rainwater, Chase and Cothren, Jackson and Le, Ngan}, title = {RSSep: Sequence-to-Sequence Model for Simultaneous Referring Remote Sensing Segmentation and Detection}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV) Workshops}, month = {December}, year = {2024}, pages = {218-231} }
RSSep: Sequence-to-Sequence Model for Simultaneous Referring Remote Sensing Segmentation and Detection
Abstract
Semantic segmentation in remote sensing images plays a crucial role in a wide range of geographic information applications. Despite the abundance of data, this field faces limitations due to the restricted set of categories and the inability of existing methods to accurately describe and localize individual or multiple objects within scenes. Addressing this challenge, the emerging fields of referring remote sensing image segmentation (RRSIS) and referring remote sensing object detection (RRSOD) have recently garnered attention. Both tasks, RRSIS and RRSOD, combine computer vision and natural language processing to localize objects based on a text query, with the outputs being segmentation masks and bounding boxes. Additionally, boundary information in remote sensing images, such as land-cover delineations, is crucial for segmentation tasks. To tackle this novel challenge, we introduce RSSep, a Sequenceto- Sequence model designed for simultaneous RRSIS and RRSOD. Unlike conventional approaches that use encoder-decoder blocks for pixellevel classification, our network leverages a sequence-to-sequence model to estimate polygonal boundaries, represented as sequences of vertices. Furthermore, we enhanced the network by improving the text encoder using both query and object noun features, employing the same architecture to extract these features. Our network is benchmarked on the recently introduced RRSIS-D dataset, notable for its extensive collection of image-caption-mask triplets across diverse scales and variations. Experimental results demonstrate the superiority of our method over existing techniques in both the RRSIS and RRSOD fields, underscoring its efficacy in semantic segmentation and object detection tasks in remote sensing imagery.
Related Material