A Transformer-Based Late-Fusion Mechanism for Fine-Grained Object Recognition in Videos

Jannik Koch, Stefan Wolf, Jürgen Beyerer; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, 2023, pp. 100-109

Abstract


Fine-grained image classification is limited by only considering a single view while in many cases, like surveillance, a whole video exists which provides multiple perspectives. However, the potential of videos is mostly considered in the context of action recognition while fine-grained object recognition is rarely considered as an application for video classification. This leads to recent video classification architectures being inappropriate for the task of fine-grained object recognition. We propose a novel, Transformer-based late-fusion mechanism for fine-grained video classification. Our approach achieves superior results to both early-fusion mechanisms, like the Video Swin Transformer, and a simple consensus-based late-fusion baseline with a modern Swin Transformer backbone. Additionally, we achieve improved efficiency, as our results show a high increase in accuracy with only a slight increase in computational complexity. Code is available at: https://github.com/wolfstefan/tlf.

Related Material


[pdf]
[bibtex]
@InProceedings{Koch_2023_WACV, author = {Koch, Jannik and Wolf, Stefan and Beyerer, J\"urgen}, title = {A Transformer-Based Late-Fusion Mechanism for Fine-Grained Object Recognition in Videos}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops}, month = {January}, year = {2023}, pages = {100-109} }