-
[pdf]
[supp]
[bibtex]@InProceedings{Zohra_2026_CVPR, author = {Zohra, Fatimah and Zhao, Chen and Itani, Hani and Ghanem, Bernard}, title = {b-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {680-689} }
b-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment
Abstract
CLIP achieves strong zero-shot image-text retrieval by aligning global vision and text representations, yet it falls behind on fine-grained tasks even when fine-tuned on long, detailed captions. In this work, we propose b-CLIP, a multi-granular text-conditioned contrastive learning framework designed to achieve hierarchical alignment between multiple textual granularities--from full captions to sentences and phrases--and their corresponding visual regions. For each level of granularity, b-CLIP uses cross-attention to dynamically pool image patches into contextualized visual embeddings. To address the semantic overlap inherent in this hierarchy, we introduce the b-Contextualized Contrastive Alignment Loss (b-CAL), which parameterizes the trade-off between query-specific matching and intra-image contextualization under both Cross-Entropy and Binary Cross-Entropy formulations. Through extensive experiments, we demonstrate that b-CLIP significantly improves dense alignment, achieving 91.8% T2I 92.3% I2T at R@1 on Urban1K and 30.9% on FG-OVD (Hard), setting state-of-the-art among methods trained without hard negatives. b-CLIP establishes a robust, adaptive baseline for dense vision-language correspondence. The code and models are released at \href https://github.com/fzohra/B-CLIP https://github.com/fzohra/B-CLIP .
Related Material

