-
[pdf]
[supp]
[bibtex]@InProceedings{Li_2025_ICCV, author = {Li, Yanqi and Niu, Jianwei and Ren, Tao}, title = {Benefit From Seen: Enhancing Open-Vocabulary Object Detection by Bridging Visual and Textual Co-Occurrence Knowledge}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {22110-22119} }
Benefit From Seen: Enhancing Open-Vocabulary Object Detection by Bridging Visual and Textual Co-Occurrence Knowledge
Abstract
Open-Vocabulary Object Detection (OVOD) aims to localize and recognize objects from both known and novel categories. However, existing methods rely heavily on internal knowledge from Vision-Language Models (VLMs), restricting their generalization to unseen categories due to limited contextual understanding. To address this, we propose CODet, a plug-and-play framework that enhances OVOD by integrating object co-occurrence ---- a form of external contextual knowledge pervasive in real-world scenes. Specifically, CODet extracts visual co-occurrence patterns from images, aligns them with textual dependencies validated by Large Language Models (LLMs), and injects contextual co-occurrence pseudo-labels as external knowledge to guide detection. Without architectural changes, CODet consistently improves five state-of-the-art VLM-based detectors across two benchmarks, achieving notable gains (up to +2.3 AP on novel categories). Analyses further confirm its ability to encode meaningful contextual guidance, advancing open-world perception by bridging visual and textual co-occurrence knowledge.
Related Material
