TRM: Temporal Relocation Module for Video Recognition
One of the key differences between video and image understanding lies in how to model the temporal information. Due to the limit of convolution kernel size, most previous methods try to model long-term temporal information via sequentially stacked convolution layers. Such conventional manner doesn't explicitly differentiate regions/pixels with various temporal receptive requirements and may suffer from temporal information distortion. In this paper, we propose a novel Temporal Relocation Module (TRM), which can capture the long-term temporal dependence in a spatial-aware manner adaptively. Specifically, it relocates the spatial features along the temporal dimension, through which an adaptive temporal receptive field is aligned to each pixel spatial-wisely. As the relocation is performed within the global temporal interval of input video, TRM can potentially model the long-term temporal information with an equivalent receptive field of the entire video. Experiment results on three representative video recognition benchmarks demonstrate TRM outperforms previous state-of-the-arts noticeably and verifies the effectiveness of our method.