Importance-Guided Interpretability and Pruning for Video Transformers in Driver Action Recognition

Raquel Panadero Palenzuela, Dominik Schörkhuber, Margrit Gelautz; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 5295-5304

Abstract


Recently transformers have gained prominence in video action recognition due to their ability to capture spatio-temporal dependencies. Despite their effectiveness the interpretability of their self-attention mechanisms remains limited posing obstacles in understanding model decisions impacting transparency and bias identification. Additionally the computational demands of transformer architectures particularly the self-attention mechanism present practical difficulties. To tackle both challenges we adapt existing interpretability techniques and introduce a layer pruning method guided by importance metrics. In the context of driver action recognition our findings highlight the efficacy of the applied head importance metrics in pinpointing crucial attention heads and identifying key visual cues essential for recognizing driver behavior. Experimental results conducted on three mainstream video transformers demonstrate the effectiveness of the proposed pruning technique with significantly reduced computational costs and only slight performance degradation by removing low-relevance layers. Specifically on our DriverActionInsight (DAI) dataset we achieve a 23.5% FLOPs saving in compressing Video Swin with less than a 1% decrease in Top-1 accuracy.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Palenzuela_2025_WACV, author = {Palenzuela, Raquel Panadero and Sch\"orkhuber, Dominik and Gelautz, Margrit}, title = {Importance-Guided Interpretability and Pruning for Video Transformers in Driver Action Recognition}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {5295-5304} }