-
[pdf]
[supp]
[bibtex]@InProceedings{Hill_2025_WACV, author = {Hill, Cole and Yellin, Florence and Regmi, Krishna and Du, Dawei and McCloskey, Scott}, title = {Re-Identifying People in Video via Learned Temporal Attention and Multi-Modal Foundation Models}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {6259-6268} }
Re-Identifying People in Video via Learned Temporal Attention and Multi-Modal Foundation Models
Abstract
Biometric recognition from security camera video is a challenging problem when the individuals change clothes or when they are partly occluded. Others have recently demonstrated that CLIP's visual encoder performs well in this domain but existing methods fail to make use of the model's text encoder or temporal information available in video. In this paper we present VCLIP a method for person identification in videos captured in challenging poses and with changes to a person's clothing. Harnessing the power of pre-trained vision-language models we jointly train a temporal fusion network while fine-tuning the visual encoder. To leverage the cross-modal embedding space we use learned biometric pedestrian attribute features to further enhance our model's person re-identification (Re-ID) ability. We demonstrate significant performance improvements via experiments with the MEVID and CCVID datasets particularly in the more challenging clothes-changing conditions. In support of this and future methods that use textual attributes for Re-ID with multimodal models we release a dataset of annotated pedestrian attributes for the popular MEVID dataset.
Related Material