From VIS to OVIS: A Technical Report To Promote the Development of the Field
Occluded Video instance segmentation(OVIS) is a new vision task that has emerged in this years and is processed by video deep learning algorithms. It uses continuous video frames as input, generally ranging from a few frames to hundreds of frames. Before OVIS, there has a task called VIS. To tackle the task of OVIS and VIS, we design a new alghorithm called SimVTR, which based on DETR and VisTR. During the experiment, although we acquire the 27.66 mAP on OVIS test, 25.18m AP on OVIS val, and 31.9 mAP on VIS test, we have found a surprising phenomena that the evaluation mechanism is not sensitive to our mothod SimVTR. When we only use one frame to inference, the model can acquire the similar mAP as dozens frames. SimpleVTR trade off and optimizes the computing resources and effects of end-to-end video instance segmentation algorithm. We used one RTX1080Ti (11G) to experiment, and the batch size can change from 1 to 16 frames. We were surprised to find that only one frame can also get a very high score in inference. The VIS and OVIS cocoapi have some unreasonable place in ytvoseval.py. In this technical report, we prudently point out the phenomena that the evaluation mechanism could have some bug. If this is true, we need check our model to promote the process of the video instance segmentation.