Learning Multi-Scale Representations With Single-Stream Network for Video Retrieval
With the explosive growth of video contents in the Internet, video retrieval has become an important issue that can benefit video recommendation and copyright detection. Since the key features of a video may distribute in distant regions of a lengthy video, several works have made a success by exploiting multi-stream, multi-scale architectures to learn and merge distant features. However, a multi-stream network is costly in terms of memory and computing overhead. The number of scales and these scales are handcrafted and fixed once a model is finalized. Further, being more complicated, multi-stream networks are more prone to being overfitting and lead to poorer generalization. This paper proposes a single-stream network with built-in dilated spatial and temporal learning capability. By combining with modern techniques, including Denoising Autoencoder, Squeeze-and-Excitation Attention, and Triplet Comparative Mechanism, our model achieves state-of-the-art performance in several video retrieval tasks on the FIVR-200K, CC_WEB_VIDEO, and EVVE datasets.