Video Event Understanding Using Natural Language Descriptions

Vignesh Ramanathan, Percy Liang, Li Fei-Fei; Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2013, pp. 905-912


Human action and role recognition play an important part in complex event understanding. State-of-the-art methods learn action and role models from detailed spatio temporal annotations, which requires extensive human effort. In this work, we propose a method to learn such models based on natural language descriptions of the training videos, which are easier to collect and scale with the number of actions and roles. There are two challenges with using this form of weak supervision: First, these descriptions only provide a high-level summary and often do not directly mention the actions and roles occurring in a video. Second, natural language descriptions do not provide spatio temporal annotations of actions and roles. To tackle these challenges, we introduce a topic-based semantic relatedness (SR) measure between a video description and an action and role label, and incorporate it into a posterior regularization objective. Our event recognition system based on these action and role models matches the state-ofthe-art method on the TRECVID-MED11 event kit, despite weaker supervision.

Related Material

author = {Ramanathan, Vignesh and Liang, Percy and Fei-Fei, Li},
title = {Video Event Understanding Using Natural Language Descriptions},
booktitle = {Proceedings of the IEEE International Conference on Computer Vision (ICCV)},
month = {December},
year = {2013}