Scene-adaptive and Region-aware Multi-modal Prompt for Open Vocabulary Object Detection

Xiaowei Zhao, Xianglong Liu, Duorui Wang, Yajun Gao, Zhide Liu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 16741-16750

Abstract


Open Vocabulary Object Detection (OVD) aims to detect objects from novel classes described by text inputs based on the generalization ability of trained classes. Existing methods mainly focus on transferring knowledge from large Vision and Language models (VLM) to detectors through knowledge distillation. However these approaches show weak ability in adapting to diverse classes and aligning between the image-level pre-training and region-level detection thereby impeding effective knowledge transfer. Motivated by the prompt tuning we propose scene-adaptive and region-aware multi-modal prompts to address these issues by effectively adapting class-aware knowledge from VLM to the detector at the region level. Specifically to enhance the adaptability to diverse classes we design a scene-adaptive prompt generator from a scene perspective to consider both the commonality and diversity of the class distributions and formulate a novel selection mechanism to facilitate the acquisition of common knowledge across all classes and specific insights relevant to each scene. Meanwhile to bridge the gap between the pre-trained model and the detector we present a region-aware multi-modal alignment module which employs the region prompt to incorporate the positional information for feature distillation and integrates textual prompts to align visual and linguistic representations. Extensive experimental results demonstrate that the proposed method significantly outperforms the state-of-the-art models on the OV-COCO and OV-LVIS datasets surpassing the current method by 3.0% mAP and 4.6% APr .

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Zhao_2024_CVPR, author = {Zhao, Xiaowei and Liu, Xianglong and Wang, Duorui and Gao, Yajun and Liu, Zhide}, title = {Scene-adaptive and Region-aware Multi-modal Prompt for Open Vocabulary Object Detection}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {16741-16750} }