MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation

Yazan Abu Farha, Jurgen Gall; The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 3575-3584

Abstract


Temporally locating and classifying action segments in long untrimmed videos is of particular interest to many applications like surveillance and robotics. While traditional approaches follow a two-step pipeline, by generating frame-wise probabilities and then feeding them to high-level temporal models, recent approaches use temporal convolutions to directly classify the video frames. In this paper, we introduce a multi-stage architecture for the temporal action segmentation task. Each stage features a set of dilated temporal convolutions to generate an initial prediction that is refined by the next one. This architecture is trained using a combination of a classification loss and a proposed smoothing loss that penalizes over-segmentation errors. Extensive evaluation shows the effectiveness of the proposed model in capturing long-range dependencies and recognizing action segments. Our model achieves state-of-the-art results on three challenging datasets: 50Salads, Georgia Tech Egocentric Activities (GTEA), and the Breakfast dataset.

Related Material


[pdf]
[bibtex]
@InProceedings{Farha_2019_CVPR,
author = {Farha, Yazan Abu and Gall, Jurgen},
title = {MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2019}
}