-
[pdf]
[bibtex]@InProceedings{Koch_2023_WACV, author = {Koch, Jannik and Wolf, Stefan and Beyerer, J\"urgen}, title = {A Transformer-Based Late-Fusion Mechanism for Fine-Grained Object Recognition in Videos}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops}, month = {January}, year = {2023}, pages = {100-109} }
A Transformer-Based Late-Fusion Mechanism for Fine-Grained Object Recognition in Videos
Abstract
Fine-grained image classification is limited by only considering a single view while in many cases, like surveillance, a whole video exists which provides multiple perspectives. However, the potential of videos is mostly considered in the context of action recognition while fine-grained object recognition is rarely considered as an application for video classification. This leads to recent video classification architectures being inappropriate for the task of fine-grained object recognition. We propose a novel, Transformer-based late-fusion mechanism for fine-grained video classification. Our approach achieves superior results to both early-fusion mechanisms, like the Video Swin Transformer, and a simple consensus-based late-fusion baseline with a modern Swin Transformer backbone. Additionally, we achieve improved efficiency, as our results show a high increase in accuracy with only a slight increase in computational complexity. Code is available at: https://github.com/wolfstefan/tlf.
Related Material