Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods

Mingqi Jiang, Saeed Khorram, Li Fuxin; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 9546-9555

Abstract


In order to gain insights about the decision-making of different visual recognition backbones we propose two methodologies sub-explanation counting and cross-testing that systematically applies deep explanation algorithms on a dataset-wide basis and compares the statistics generated from the amount and nature of the explanations. These methodologies reveal the difference among networks in terms of two properties called compositionality and disjunctivism. Transformers and ConvNeXt are found to be more compositional in the sense that they jointly consider multiple parts of the image in building their decisions whereas traditional CNNs and distilled transformers are less compositional and more disjunctive which means that they use multiple diverse but smaller set of parts to achieve a confident prediction. Through further experiments we pinpointed the choice of normalization to be especially important in the compositionality of a model in that batch normalization leads to less compositionality while group and layer normalization lead to more. Finally we also analyze the features shared by different backbones and plot a landscape of different models based on their feature-use similarity.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Jiang_2024_CVPR, author = {Jiang, Mingqi and Khorram, Saeed and Fuxin, Li}, title = {Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {9546-9555} }