-
[pdf]
[bibtex]@InProceedings{Chen_2025_CVPR, author = {Chen, Long and Chen, Yuling and Luo, Yun and Dou, Hui and Zhong, Xinyang}, title = {Attention-Guided Hierarchical Defense for Multimodal Attacks in Vision-Language Models}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops}, month = {June}, year = {2025}, pages = {1607-1617} }
Attention-Guided Hierarchical Defense for Multimodal Attacks in Vision-Language Models
Abstract
Pretrained Vision-Language Models (VLMs) like CLIP demonstrate impressive zero-shot classification capabilities through cross-modal alignment.However, these models remain vulnerable to adversarial attacks.Existing defense research primarily focuses on image attacks, neglecting text attacks and multimodal collaborative attack.Through experimental analysis, we have observed two phenomena in adversarial samples: cross-modal attention shift and intra-modal self-attention distortion.Based on these observations, we propose an Attention-Guided Hierarchical Multimodal Adversarial Training (AGH-MAT) framework. This framework has two modules.The first is the Bidirectional Attention Alignment (BAA) module. Based on text-guided image attention, we design image-guided text attention. The BAA aligns the attention distributions of adversarial and clean samples in the target model. Meanwhile, the original model constrains the target model to ensure generalization, effectively mitigating cross-modal semantic shift.The second module is the Hierarchical Self-Attention Fusion (HSAF) module. It employs a hierarchical feature integration strategy. It merges shallow geometric features with deep semantic representations into a spatially consistent cross-layer attention expression. This design counteracts intra-modal attention distortion. Experiments on 15 datasets demonstrate that our method exhibits superior performance in both in-distribution and zero-shot scenarios, providing a reliable defense paradigm for trustworthy multimodal systems.
Related Material