Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID

Wentan Tan, Changxing Ding, Jiayu Jiang, Fei Wang, Yibing Zhan, Dapeng Tao; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 17127-17137

Abstract


Text-to-image person re-identification (ReID) retrieves pedestrian images according to textual descriptions. Manually annotating textual descriptions is time-consuming restricting the scale of existing datasets and therefore the generalization ability of ReID models. As a result we study the transferable text-to-image ReID problem where we train a model on our proposed large-scale database and directly deploy it to various datasets for evaluation. We obtain substantial training data via Multi-modal Large Language Models (MLLMs). Moreover we identify and address two key challenges in utilizing the obtained textual descriptions. First an MLLM tends to generate descriptions with similar structures causing the model to overfit specific sentence patterns. Thus we propose a novel method that uses MLLMs to caption images according to various templates. These templates are obtained using a multi-turn dialogue with a Large Language Model (LLM). Therefore we can build a large-scale dataset with diverse textual descriptions. Second an MLLM may produce incorrect descriptions. Hence we introduce a novel method that automatically identifies words in a description that do not correspond with the image. This method is based on the similarity between one text and all patch token embeddings in the image. Then we mask these words with a larger probability in the subsequent training epoch alleviating the impact of noisy textual descriptions. The experimental results demonstrate that our methods significantly boost the direct transfer text-to-image ReID performance. Benefiting from the pre-trained model weights we also achieve state-of-the-art performance in the traditional evaluation settings.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Tan_2024_CVPR, author = {Tan, Wentan and Ding, Changxing and Jiang, Jiayu and Wang, Fei and Zhan, Yibing and Tao, Dapeng}, title = {Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {17127-17137} }