A Single-Stage, Bottom-Up Approach for Occluded VIS Using Spatio-Temporal Embeddings

Ali Athar, Sabarinath Mahadevan, Aljos̆a Os̆ep, Laura Leal-Taixé, Bastian Leibe; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2021, pp. 3858-3862

Abstract


The task of Video Instance Segmentation (VIS) involves segmenting, tracking and classifying all object instances present in a given video clip. Occluded VIS is a more challenging extension of this task which involves longer video sequences where objects undergo significant occlusions over time. Most existing approaches to VIS involve multiple networks which separately handle segmenting, tracking and classifying object instances, and potentially a set of heuristics to combine the individual network outputs. By contrast, we employ just one, single-stage network without any heuristics or post-processing for the end-to-end task. Our approach is called 'STEm-Seg', which is a bottom-up method for Segmenting object instances in videos using Spatio-Temporal Embeddings. We achieve 3rd place in the Occluded VIS challenge with an mAP score of 21.6% on the test set.

Related Material


[pdf]
[bibtex]
@InProceedings{Athar_2021_ICCV, author = {Athar, Ali and Mahadevan, Sabarinath and Os̆ep, Aljos̆a and Leal-Taix\'e, Laura and Leibe, Bastian}, title = {A Single-Stage, Bottom-Up Approach for Occluded VIS Using Spatio-Temporal Embeddings}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2021}, pages = {3858-3862} }