-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Zhao_2024_CVPR, author = {Zhao, Shiyu and Schulter, Samuel and Zhao, Long and Zhang, Zhixing and G, Vijay Kumar B and Suh, Yumin and Chandraker, Manmohan and Metaxas, Dimitris N.}, title = {Taming Self-Training for Open-Vocabulary Object Detection}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {13938-13947} }
Taming Self-Training for Open-Vocabulary Object Detection
Abstract
Recent studies have shown promising performance in open-vocabulary object detection (OVD) by utilizing pseudo labels (PLs) from pretrained vision and language models (VLMs). However teacher-student self-training a powerful and widely used paradigm to leverage PLs is rarely explored for OVD. This work identifies two challenges of using self-training in OVD: noisy PLs from VLMs and frequent distribution changes of PLs. To address these challenges we propose SAS-Det that tames self-training for OVD from two key perspectives. First we present a split-and-fusion (SAF) head that splits a standard detection into an open-branch and a closed-branch. This design can reduce noisy supervision from pseudo boxes. Moreover the two branches learn complementary knowledge from different training data significantly enhancing performance when fused together. Second in our view unlike in closed-set tasks the PL distributions in OVD are solely determined by the teacher model. We introduce a periodic update strategy to decrease the number of updates to the teacher thereby decreasing the frequency of changes in PL distributions which stabilizes the training process. Extensive experiments demonstrate SAS-Det is both efficient and effective. SAS-Det outperforms recent models of the same scale by a clear margin and achieves 37.4 AP50 and 29.1 APr on novel categories of the COCO and LVIS benchmarks respectively. Code is available at https://github.com/xiaofeng94/SAS-Det.
Related Material