HR-STAN: High-Resolution Spatio-Temporal Attention Network for 3D Human Motion Prediction
3D human motion prediction requires making sense of the complex spatio-temporal dynamics which underpin human motion to make highly accurate predictions. Part of this complexity is due to the trade-off between long-term (>400ms) and short-term predictions (<400ms) which require different levels of granularity to observe patterns. Several works have explored methods of improving long-term prediction performance by utilizing longer motion histories but this typically comes at the cost of very short-term (<200ms) performance. Inspired by high-resolution network architectures, we propose a novel high-resolution spatio-temporal attention network (HR-STAN) which leverages parallel feature branches and dilated convolutions to observe human motion at different scales. Furthermore, we augment this architecture with split spatial and temporal attention mechanisms to efficiently capture spatio-temporal dependencies within a given motion. We evaluate the ability of our HR-STAN architecture at incorporating long-term motion histories while producing short-term predictions and show that it improves over several state-of-the-art methods on both the AMASS and Human3.6M benchmarks.