Unsupervised Semantic Parsing of Video Collections

Ozan Sener, Amir R. Zamir, Silvio Savarese, Ashutosh Saxena; Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015, pp. 4480-4488


Human communication typically has an underlying structure. This is reflected in the fact that in many user generated videos, a starting point, ending, and certain objective steps between these two can be identified. In this paper, we propose a method for parsing a video into such semantic steps in an unsupervised way. The proposed method is capable of providing a semantic ``storyline'' of the video composed of its objective steps. We accomplish this utilizing both visual and language cues in a joint generative model. The proposed method can also provide a textual description for each of identified semantic steps and video segments. We evaluate this method on a large number of complex YouTube videos and show results of unprecedented quality for this new and impactful problem.

Related Material

author = {Sener, Ozan and Zamir, Amir R. and Savarese, Silvio and Saxena, Ashutosh},
title = {Unsupervised Semantic Parsing of Video Collections},
booktitle = {Proceedings of the IEEE International Conference on Computer Vision (ICCV)},
month = {December},
year = {2015}