CASPA: Graph-Structured Concept Anchors for Modality-Agnostic Adaptation in Vision-Language Models

Chatterjee, Abhiroop; Ghosh, Susmita; Ghosh, Ashish; Ientilucci, Emmett

Abhiroop Chatterjee, Susmita Ghosh, Ashish Ghosh, Emmett Ientilucci; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 31566-31576

Abstract

Recent advances in vision-language models (VLMs) have revealed both the promise and the rigidity of large-scale pretraining. Despite their impressive zero-shot generalization, existing adaptation paradigms--whether prompt-tuning, adapter injection, or fine-tuning--remain class-specific, modality-biased, and structure-agnostic. However, these design choices limit reasoning-level transfer across tasks. To this end, we rethink adaptation as a shared conceptual structure rather than a per-class specialization. We propose CASPA (Concept-Anchored Semantic Prompt Adapter), a dual-anchor semantic adapter that jointly learns shared text and image anchors as a bidirectional conceptual interface between modalities. Each class learns a soft association distribution over these anchors, producing compositional representations which enable parameter sharing and semantic reuse. To further align visual and textual reasoning spaces, CASPA employs Semantic Cross-Consistency Regularization (S-XCR), enforcing geometric and semantic agreement between text- and image-conditioned anchor mixtures. CASPA, therefore, provides a structurally constrained alternative to class-conditional prompt parameterization while keeping the CLIP backbone frozen. Evaluated across four regimes, Base-to-Novel setup, cross-data transfer, few-shot, and backbone-agnostic evaluations, on eleven diverse visual recognition datasets, CASPA matches or outperforms state-of-the-art methods.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Chatterjee_2026_CVPR, author = {Chatterjee, Abhiroop and Ghosh, Susmita and Ghosh, Ashish and Ientilucci, Emmett}, title = {CASPA: Graph-Structured Concept Anchors for Modality-Agnostic Adaptation in Vision-Language Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {31566-31576} }