Long-Term Action Forecasting Using Multi-Headed Attention-Based Variational Recurrent Neural Networks

Siyuan Brandon Loh, Debaditya Roy, Basura Fernando; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2022, pp. 2419-2427

Abstract


Systems developed for predicting both the action and the amount of time someone might take to perform that action need to be aware of the inherent uncertainty in what humans do. Here, we present a novel hybrid generative model for action anticipation that attempts to capture the uncertainty in human actions. Our model uses a multi-headed attention-based variational generative model for action prediction (MAVAP), and Gaussian log-likelihood maximization to predict the corresponding action's duration. During training, we optimise three losses: a variational loss, a negative log-likelihood loss, and a discriminative cross-entropy loss. We evaluate our model on standard datasets (i.e., Breakfast and 50Salads) for action forecasting tasks, and demonstrate improvements over prior methods using both ground truth observations and predicted features from an action segmentation network (i.e., MS-TCN++). We also show that factorizing the latent space across multiple Gaussian heads predicts better plausible future action sequences compared to a single Gaussian.

Related Material


[pdf]
[bibtex]
@InProceedings{Loh_2022_CVPR, author = {Loh, Siyuan Brandon and Roy, Debaditya and Fernando, Basura}, title = {Long-Term Action Forecasting Using Multi-Headed Attention-Based Variational Recurrent Neural Networks}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2022}, pages = {2419-2427} }