Where and Why Are They Looking? Jointly Inferring Human Attention and Intentions in Complex Tasks

Ping Wei, Yang Liu, Tianmin Shu, Nanning Zheng, Song-Chun Zhu; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 6801-6809

Abstract


This paper addresses a new problem - jointly inferring human attention, intentions, and tasks from videos. Given an RGB-D video where a human performs a task, we answer three questions simultaneously: 1) where the human is looking - attention prediction; 2) why the human is looking there - intention prediction; and 3) what task the human is performing - task recognition. We propose a hierarchical model of human-attention-object (HAO) which represents tasks, intentions, and attention under a unified framework. A task is represented as sequential intentions which transition to each other. An intention is composed of the human pose, attention, and objects. A beam search algorithm is adopted for inference on the HAO graph to output the attention, intention, and task results. We built a new video dataset of tasks, intentions, and attention. It contains 14 task classes, 70 intention categories, 28 object classes, 809 videos, and approximately 330,000 frames. Experiments show that our approach outperforms existing approaches.

Related Material


[pdf]
[bibtex]
@InProceedings{Wei_2018_CVPR,
author = {Wei, Ping and Liu, Yang and Shu, Tianmin and Zheng, Nanning and Zhu, Song-Chun},
title = {Where and Why Are They Looking? Jointly Inferring Human Attention and Intentions in Complex Tasks},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2018}
}