A Simple Framework for Open-Vocabulary Segmentation and Detection

Zhang, Hao; Li, Feng; Zou, Xueyan; Liu, Shilong; Li, Chunyuan; Yang, Jianwei; Zhang, Lei

Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianwei Yang, Lei Zhang; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 1020-1031

Abstract

In this work, we present OpenSeeD, a simple Open-vocabulary Segmentation and Detection framework that learns from different segmentation and detection datasets. To bridge the gap of vocabulary and annotation granularity, we first introduce a pretrained text encoder to encode all the visual concepts in two tasks and learn a common semantic space for them. This gives us reasonably good results compared with the counterparts trained on segmentation task only. To further reconcile them, we locate two discrepancies: i) task discrepancy -- segmentation requires extracting masks for both foreground objects and background stuff, while detection merely cares about the former; ii) data discrepancy -- box and mask annotations are with different spatial granularity, and thus not directly interchangeable. We propose a decoupled foreground/background decoding and a conditioned mask decoding to address these issues, respectively. To this end, we develop a simple encoder-decoder model encompassing all three techniques and train it jointly on COCO and Objects365. After pretraining, our model exhibits competitive or stronger zero-shot transferability for both segmentation and detection. Specifically, OpenSeeD beats the state-of-the-art method for open-vocabulary instance and panoptic segmentation across 5 datasets, and outperforms previous work for open-vocabulary detection on LVIS and ODinW under similar settings. When transferred to specific tasks, our model achieves new SoTA on panoptic segmentation on COCO and ADE20K, and instance segmentation on ADE20K and Cityscapes. Finally, we note that OpenSeed is the first to explore the potential of joint training on segmentation and detection, and hope it can be received as a strong baseline for developing a single model for open-vocabulary segmentation and detection.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Zhang_2023_ICCV, author = {Zhang, Hao and Li, Feng and Zou, Xueyan and Liu, Shilong and Li, Chunyuan and Yang, Jianwei and Zhang, Lei}, title = {A Simple Framework for Open-Vocabulary Segmentation and Detection}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {1020-1031} }