Exploring Region-Word Alignment in Built-in Detector for Open-Vocabulary Object Detection

Heng Zhang, Qiuyu Zhao, Linyu Zheng, Hao Zeng, Zhiwei Ge, Tianhao Li, Sulong Xu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 16975-16984

Abstract


Open-vocabulary object detection aims to detect novel categories that are independent from the base categories used during training. Most modern methods adhere to the paradigm of learning vision-language space from a large-scale multi-modal corpus and subsequently transferring the acquired knowledge to off-the-shelf detectors like Faster-RCNN. However information attenuation or destruction may occur during the process of knowledge transfer due to the domain gap hampering the generalization ability on novel categories. To mitigate this predicament in this paper we present a novel framework named BIND standing for Bulit-IN Detector to eliminate the need for module replacement or knowledge transfer to off-the-shelf detectors. Specifically we design a two-stage training framework with an Encoder-Decoder structure. In the first stage an image-text dual encoder is trained to learn region-word alignment from a corpus of image-text pairs. In the second stage a DETR-style decoder is trained to perform detection on annotated object detection datasets. In contrast to conventional manually designed non-adaptive anchors which generate numerous redundant proposals we develop an anchor proposal network that generates anchor proposals with high likelihood based on candidates adaptively thereby substantially improving detection efficiency. Experimental results on two public benchmarks COCO and LVIS demonstrate that our method stands as a state-of-the-art approach for open-vocabulary object detection.

Related Material


[pdf]
[bibtex]
@InProceedings{Zhang_2024_CVPR, author = {Zhang, Heng and Zhao, Qiuyu and Zheng, Linyu and Zeng, Hao and Ge, Zhiwei and Li, Tianhao and Xu, Sulong}, title = {Exploring Region-Word Alignment in Built-in Detector for Open-Vocabulary Object Detection}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {16975-16984} }