Jointly Recognizing Object Fluents and Tasks in Egocentric Videos

Yang Liu, Ping Wei, Song-Chun Zhu; The IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2924-2932

Abstract


This paper addresses the problem of jointly recognizing object fluents and tasks in egocentric videos. Fluents are the changeable attributes of objects. Tasks are goal-oriented human activities which interact with objects and aim to change some attributes of the objects. The process of executing a task is a process to change the object fluents over time. We propose a hierarchical model to represent tasks as concurrent and sequential object fluents. In a task, different fluents closely interact with each other both in spatial and temporal domains. Given an egocentric video, a beam search algorithm is applied to jointly recognizing the object fluents in each frame, and the task of the entire video. We collected a large scale egocentric video dataset of tasks and fluents. This dataset contains 14 categories of tasks, 25 object classes, 21 categories of object fluents, 809 video sequences, and approximately 333,000 video frames. The experimental results on this dataset prove the strength of our method.

Related Material


[pdf]
[bibtex]
@InProceedings{Liu_2017_ICCV,
author = {Liu, Yang and Wei, Ping and Zhu, Song-Chun},
title = {Jointly Recognizing Object Fluents and Tasks in Egocentric Videos},
booktitle = {The IEEE International Conference on Computer Vision (ICCV)},
month = {Oct},
year = {2017}
}