ESCNet: Gaze Target Detection With the Understanding of 3D Scenes

Jun Bao, Buyu Liu, Jun Yu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 14126-14135

Abstract


This paper aims to address the single image gaze target detection problem. Conventional methods either focus on 2D visual cues or exploit additional depth information in a very coarse manner. In this work, we propose to explicitly and effectively model 3D geometry under challenging scenario where only 2D annotations are available. We first obtain 3D point clouds of given scene with estimated depth and reference objects. Then we figure out the front-most points in all possible 3D directions of given person. These points are later leveraged in our ESCNet model. Specifically, ESCNet consists of geometry and scene parsing modules. The former produces an initial heatmap inferring the probability that each front-most point has been looking at according to estimated 3D gaze direction. And the latter further explores scene contextual cues to regulate detection results. We validate our idea on two publicly available dataset, GazeFollow and VideoAttentionTarget, and demonstrate the state-of-the-art performance. Our method also beats the human in terms of AUC on GazeFollow.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Bao_2022_CVPR, author = {Bao, Jun and Liu, Buyu and Yu, Jun}, title = {ESCNet: Gaze Target Detection With the Understanding of 3D Scenes}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2022}, pages = {14126-14135} }