SWAG-V: Explanations for Video Using Superpixels Weighted by Average Gradients

Thomas Hartley, Kirill Sidorov, Christopher Willis, David Marshall; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2022, pp. 604-613

Abstract


CNN architectures that take videos as an input are often overlooked when it comes to the development of explanation techniques. This is despite their use in critical domains such as surveillance and healthcare. Explanation techniques developed for these networks must take into account the additional temporal domain if they are to be successful. In this paper we introduce SWAG-V, an extension of SWAG for use with networks that take video as an input. By creating superpixels that incorporate individual frames of the input video we are able to create explanations that better locate regions of the input that are important to the networks prediction. We demonstrate using Kinetics-400 with both the C3D and R(2+1)D network architectures that SWAG-V outperforms Grad-CAM, Grad-CAM++ and Saliency Tubes over a range of common metrics such as explanation accuracy and localisation.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Hartley_2022_WACV, author = {Hartley, Thomas and Sidorov, Kirill and Willis, Christopher and Marshall, David}, title = {SWAG-V: Explanations for Video Using Superpixels Weighted by Average Gradients}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2022}, pages = {604-613} }