b-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment

Zohra, Fatimah; Zhao, Chen; Itani, Hani; Ghanem, Bernard

Fatimah Zohra, Chen Zhao, Hani Itani, Bernard Ghanem; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 680-689

Abstract

CLIP achieves strong zero-shot image-text retrieval by aligning global vision and text representations, yet it falls behind on fine-grained tasks even when fine-tuned on long, detailed captions. In this work, we propose b-CLIP, a multi-granular text-conditioned contrastive learning framework designed to achieve hierarchical alignment between multiple textual granularities--from full captions to sentences and phrases--and their corresponding visual regions. For each level of granularity, b-CLIP uses cross-attention to dynamically pool image patches into contextualized visual embeddings. To address the semantic overlap inherent in this hierarchy, we introduce the b-Contextualized Contrastive Alignment Loss (b-CAL), which parameterizes the trade-off between query-specific matching and intra-image contextualization under both Cross-Entropy and Binary Cross-Entropy formulations. Through extensive experiments, we demonstrate that b-CLIP significantly improves dense alignment, achieving 91.8% T2I 92.3% I2T at R@1 on Urban1K and 30.9% on FG-OVD (Hard), setting state-of-the-art among methods trained without hard negatives. b-CLIP establishes a robust, adaptive baseline for dense vision-language correspondence. The code and models are released at \href https://github.com/fzohra/B-CLIP https://github.com/fzohra/B-CLIP .

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Zohra_2026_CVPR, author = {Zohra, Fatimah and Zhao, Chen and Itani, Hani and Ghanem, Bernard}, title = {b-CLIP: Text-Conditioned Contrastive Learning for Multi-Granular Vision-Language Alignment}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {680-689} }