Multiple Instance Triplet Loss for Weakly Supervised Multi-Label Action Localisation of Interacting Persons

Sovan Biswas, Jürgen Gall; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2021, pp. 2159-2167

Abstract


With the abundance of videos and the high cost of data annotation, weakly supervised action localisation has gained more attention. However, most of the works on weakly supervised action localisation focus on single action and single person action localisation. Recently, new approaches have been proposed to extend the weakly supervised action localisation task towards multi-label scenarios where multiple persons can interact with each other and perform multiple actions at the same time. For longer videos, these methods subdivide the training videos into very short clips and discard the temporal consistency of actions across these short clips. In this work, we address this issue and propose the Multiple Instance Triplet Loss (MITL) where consistent instances that are temporally close should be more similar than distant and inconsistent instances. It is an extension of the triplet loss to bags where a bag comprises all person detections at a keyframe. We evaluate our proposed approach on the challenging AVA dataset where it achieves state-of-the-art results when the weakly labelled training videos are longer than 1 second.

Related Material


[pdf]
[bibtex]
@InProceedings{Biswas_2021_ICCV, author = {Biswas, Sovan and Gall, J\"urgen}, title = {Multiple Instance Triplet Loss for Weakly Supervised Multi-Label Action Localisation of Interacting Persons}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2021}, pages = {2159-2167} }