SkipPLUS: Skip the First Few Layers to Better Explain Vision Transformers

Faridoun Mehri, Mohsen Fayyaz, Mahdieh Soleymani Baghshah, Mohammad Taher Pilehvar; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 204-215

Abstract


Despite their remarkable performance the explainability of Vision Transformers (ViTs) remains a challenge. While forward attention-based token attribution techniques have become popular in text processing their suitability for ViTs hasn't been extensively explored. In this paper we compare these methods against state-of-the-art input attribution methods from the Vision literature revealing their limitations due to improper aggregation of information across layers. To address this we introduce two general techniques PLUS and SkipPLUS that can be composed with any input attribution method to more effectively aggregate information across layers while handling noisy layers. Through comprehensive and quantitative evaluations of faithfulness and human interpretability on a variety of ViT architectures and datasets we demonstrate the effectiveness of PLUS and SkipPLUS establishing a new state-of-the-art in white-box token attribution. We conclude with a comparative analysis highlighting the strengths and weaknesses of the best versions of all the studied methods. The code used in this paper is freely available at https://github.com/NightMachinery/SkipPLUS-CVPR-2024.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Mehri_2024_CVPR, author = {Mehri, Faridoun and Fayyaz, Mohsen and Baghshah, Mahdieh Soleymani and Pilehvar, Mohammad Taher}, title = {SkipPLUS: Skip the First Few Layers to Better Explain Vision Transformers}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {204-215} }