Visualizing Paired Image Similarity in Transformer Networks

Samuel Black, Abby Stylianou, Robert Pless, Richard Souvenir; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2022, pp. 3164-3173

Abstract


Transformer architectures have shown promise for a wide range of computer vision tasks, including image embedding. As was the case with convolutional neural networks and other models, explainability of the predictions is a key concern, but visualization approaches tend to be architecture-specific. In this paper, we introduce a new method for producing interpretable visualizations that, given a pair of images encoded with a Transformer, show which regions contributed to their similarity. Additionally, for the task of image retrieval, we compare the performance of Transformer and ResNet models of similar capacity and show that while they have similar performance in aggregate, the retrieved results and the visual explanations for those results are quite different. Code is available at https://github.com/vidarlab/xformer-paired-viz.

Related Material


[pdf]
[bibtex]
@InProceedings{Black_2022_WACV, author = {Black, Samuel and Stylianou, Abby and Pless, Robert and Souvenir, Richard}, title = {Visualizing Paired Image Similarity in Transformer Networks}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2022}, pages = {3164-3173} }