Deep Spatial-Temporal Fusion Network for Video-Based Person Re-Identification

Lin Chen, Hua Yang, Ji Zhu, Qin Zhou, Shuang Wu, Zhiyong Gao; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2017, pp. 63-70

Abstract


In this paper, we propose a novel deep end-to-end network to automatically learn the spatial-temporal fusion features for video-based person re-identification. Specifically, the proposed network consists of CNN and RNN to jointly learn both the spatial and the temporal features of input image sequences. The network is optimized by utilizing the siamese and softmax losses simultaneously to pull the instances of the same person closer and push the instances of different persons apart. Our network is trained on full-body and part-body image sequences respectively to learn complementary representations from holistic and local perspectives. By combining them together, we obtain more discriminative features that are beneficial to person re-identification. Experiments conducted on the PRID-2011, i-LIDS-VIS and MARS datasets show that the proposed method performs favorably against existing approaches.

Related Material


[pdf]
[bibtex]
@InProceedings{Chen_2017_CVPR_Workshops,
author = {Chen, Lin and Yang, Hua and Zhu, Ji and Zhou, Qin and Wu, Shuang and Gao, Zhiyong},
title = {Deep Spatial-Temporal Fusion Network for Video-Based Person Re-Identification},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {July},
year = {2017}
}