Learning Spatiotemporal Attention for Egocentric Action Recognition

Minlong Lu, Danping Liao, Ze-Nian Li; The IEEE International Conference on Computer Vision (ICCV), 2019, pp. 0-0


Recognizing camera wearers' actions from videos captured by the head-mounted camera is a challenging task. Previous methods often utilize attention models to characterize the relevant spatial regions to facilitate egocentric action recognition. Inspired by the recent advances of spatiotemporal feature learning using 3D convolutions, we propose a simple yet efficient module for learning spatiotemporal attention in egocentric videos with human gaze as supervision. Our model employs a two-stream architecture which consists of an appearance-based stream and motion-based stream. Each stream has the spatiotemporal attention module (STAM) to produce an attention map, which helps our model to focus on the relevant spatiotemporal regions of the video for action recognition. The experimental results demonstrate that our model is able to outperform the state-of-the-art methods by a large margin on the standard EGTEA Gaze+ dataset and produce attention maps that are consistent with human gaze.

Related Material

author = {Lu, Minlong and Liao, Danping and Li, Ze-Nian},
title = {Learning Spatiotemporal Attention for Egocentric Action Recognition},
booktitle = {The IEEE International Conference on Computer Vision (ICCV) Workshops},
month = {Oct},
year = {2019}