-
[pdf]
[supp]
[bibtex]@InProceedings{Singh_2025_WACV, author = {Singh, Jaisidh and Shrivastava, Ishaan and Vatsa, Mayank and Singh, Richa and Bharati, Aparna}, title = {Learning the Power of ''No'': Foundation Models with Negations}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {7991-8001} }
Learning the Power of "No": Foundation Models with Negations
Abstract
Negation is a fundamental aspect of natural language reasoning yet foundational vision-language models (VLMs) like CLIP face significant challenges in accurately interpreting it. These models often process text prompts holistically making it difficult to isolate and understand the role of negated terms. To overcome this limitation we present CC-Neg: a novel dataset consisting of 228246 images each paired with both true captions and their corresponding negated versions. CC-Neg provides a critical benchmark to assess and improve foundational VLMs' ability to process negations focusing specifically on how the presence of terms like 'not' alters the semantic relationship between images and their textual descriptions. To illustrate the effectiveness of the CC-Neg dataset in enhancing negation understanding we introduce the CoN-CLIP framework which incorporates targeted modifications to CLIP's contrastive loss function. When trained with CC-Neg CoN-CLIP achieves a 3.85% average improvement in top-1 accuracy for zero-shot image classification across eight datasets and a 4.4% performance boost on challenging compositionality benchmarks such as SugarCREPE. These results highlight CoN-CLIP's enhanced understanding of the nuanced semantic relationships involving negation. Our code and the CC-Neg benchmark are available at: https://github.com/jaisidhsingh/CoN-CLIP.
Related Material