LP-OVOD: Open-Vocabulary Object Detection by Linear Probing

Chau Pham, Truong Vu, Khoi Nguyen; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 779-788

Abstract


This paper addresses the challenging problem of open-vocabulary object detection (OVOD) where an object detector must identify both seen and unseen classes in test images without labeled examples of the unseen classes in training. A typical approach for OVOD is to use joint text-image embeddings of CLIP to assign box proposals to their closest text label. However, this method has a critical issue: many low-quality boxes, such as over- and under-covered-object boxes, have the same similarity score as high-quality boxes since CLIP is not trained on exact object location information. To address this issue, we propose a novel method, LP-OVOD, that discards low-quality boxes by training a sigmoid linear classifier on pseudo labels retrieved from the top relevant region proposals to the novel text. Notably, LP-OVOD seamlessly integrates the knowledge distillation technique from ViLD, resulting in a new state-of-the-art OVOD approach. Experimental results on COCO affirm the superior performance of our approach over prior work, achieving 40.5 in AP_novel using ResNet50 as the backbone and without external datasets or knowing novel classes in training. Our code will be available at https://github.com/VinAIResearch/LP-OVOD.

Related Material


[pdf]
[bibtex]
@InProceedings{Pham_2024_WACV, author = {Pham, Chau and Vu, Truong and Nguyen, Khoi}, title = {LP-OVOD: Open-Vocabulary Object Detection by Linear Probing}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2024}, pages = {779-788} }