Temporal Convolutional Networks for Action Segmentation and Detection

Colin Lea, Michael D. Flynn, Rene Vidal, Austin Reiter, Gregory D. Hager; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 156-165

Abstract


The ability to identify and temporally segment fine-grained human actions throughout a video is crucial for robotics, surveillance, education, and beyond. Typical approaches decouple this problem by first extracting local spatiotemporal features from video frames and then feeding them into a temporal classifier that captures high-level temporal patterns. We describe a class of temporal models, which we call Temporal Convolutional Networks (TCNs), that use a hierarchy of temporal convolutions to perform fine-grained action segmentation or detection. Our Encoder-Decoder TCN uses pooling and upsampling to efficiently capture long-range temporal patterns whereas our Dilated TCN uses dilated convolutions. We show that TCNs are capable of capturing action compositions, segment durations, and long-range dependencies, and are over a magnitude faster to train than competing LSTM-based Recurrent Neural Networks. We apply these models to three challenging fine-grained datasets and show large improvements over the state of the art.

Related Material


[pdf] [arXiv]
[bibtex]
@InProceedings{Lea_2017_CVPR,
author = {Lea, Colin and Flynn, Michael D. and Vidal, Rene and Reiter, Austin and Hager, Gregory D.},
title = {Temporal Convolutional Networks for Action Segmentation and Detection},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {July},
year = {2017}
}