Temporal Aggregation with Clip-level Attention for Video-based Person Re-identification

Mengliu Li, Han Xu, Jinjun Wang, Wenpeng Li, Yongli Sun; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2020, pp. 3376-3384

Abstract


Video-based person re-identification (Re-ID) methods can extract richer features than image-based ones from short video clips. The existing methods usually apply simple strategies, such as average/max pooling, to obtain the tracklet-level features, which has been proved hard to aggregate the information from all video frames. In this paper, we propose a simple yet effective Temporal Aggregation with Clip-level Attention Network (TACAN) to solve the temporal aggregation problem in a hierarchal way. Specifically, a tracklet is firstly broken into different numbers of clips, through a two-stage temporal aggregation network we can get the tracklet-level feature representation. A novel min-max loss is introduced to learn both a clip-level attention extractor and a clip-level feature representer in the training process. Afterwards, the resulting clip-level weights are further taken to average the clip-level features, which can generate a robust tracklet-level feature representation at the testing stage. Experimental results on four benchmark datasets, including the MARS, iLIDS-VID, PRID-2011 and DukeMTMC-VideoReID, show that our TACAN has achieved significant improvements as compared with the state-of-the-art approaches.

Related Material


[pdf] [video]
[bibtex]
@InProceedings{Li_2020_WACV,
author = {Li, Mengliu and Xu, Han and Wang, Jinjun and Li, Wenpeng and Sun, Yongli},
title = {Temporal Aggregation with Clip-level Attention for Video-based Person Re-identification},
booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
month = {March},
year = {2020}
}