Learning Video Object Segmentation With Visual Memory

Pavel Tokmakov, Karteek Alahari, Cordelia Schmid; The IEEE International Conference on Computer Vision (ICCV), 2017, pp. 4481-4490


This paper addresses the task of segmenting moving objects in unconstrained videos. We introduce a novel two-stream neural network with an explicit memory module to achieve this. The two streams of the network encode spatial and temporal features in a video sequence respectively, while the memory module captures the evolution of objects over time. The module to build a 'visual memory' in video, i.e., a joint representation of all the video frames, is realized with a convolutional recurrent unit learned from a small number of training video sequences. Given a video frame as input, our approach assigns each pixel an object or background label based on the learned spatio-temporal features as well as the 'visual memory' specific to the video, acquired automatically without any manually-annotated frames. We evaluate our method extensively on two benchmarks, DAVIS and Freiburg-Berkeley motion segmentation datasets, and show state-of-the-art results. For example, our approach outperforms the top method on the DAVIS dataset by nearly 6%. We also provide an extensive ablative analysis to investigate the influence of each component in the proposed framework.

Related Material

[pdf] [arXiv] [video]
author = {Tokmakov, Pavel and Alahari, Karteek and Schmid, Cordelia},
title = {Learning Video Object Segmentation With Visual Memory},
booktitle = {The IEEE International Conference on Computer Vision (ICCV)},
month = {Oct},
year = {2017}