Reconstructing CLIP for Open-Vocabulary Dense Perception

Yajie Liu, Jinjin Zhang, Qingjie Liu, Di Huang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 39208-39218

Abstract


Large-scale vision-language models (VLMs) such as CLIP have excelled in zero-shot image classification, yet they struggle to achieve the dense cross-modal alignment required by open-vocabulary dense perception (OVDP). While recent self-distillation methods address this by aligning dense features with the generalizable global semantics, a key question remains: how should such dense features be constructed to achieve optimal alignment? To address this, we propose DenseRC, a principled Dense Representations Construction framework that reconstructs CLIP for OVDP based on two key insights.First, by analyzing the internal semantics encoded in the global cls token, we identify that multi-layer value embeddings serve as an informative basis for dense features. Second, we reveal that spatial aggregation tends to amplify semantic misalignment. Motivated by this, we design a lightweight Head-Selective Gating (HSG) module that adaptively reweights feature heads according to their intrinsic heterogeneity, enabling discriminative and alignment-friendly dense representations construction. Extensive experiments demonstrate that DenseRC delivers consistent and substantial gains across OVDP tasks including object detection and semantic segmentation, setting new state-of-the-art performance on multiple benchmarks.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Liu_2026_CVPR, author = {Liu, Yajie and Zhang, Jinjin and Liu, Qingjie and Huang, Di}, title = {Reconstructing CLIP for Open-Vocabulary Dense Perception}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {39208-39218} }