Comprehensive Visual Features and Pseudo Labeling for Robust Natural Language-Based Vehicle Retrieval
Vehicle retrieval has become crucial for public transportation and intelligent transportation systems due to the exponential development of large-scale transportation videos. Vehicle re-identification and vehicle tracking are the main components of most vision-based vehicle recovery systems. Unfortunately, the limited amount of information provided by traffic video feeds limits the vision-based vehicle retrieval algorithm's efficacy. Therefore, this article proposes a contrastive cross-modal vehicle retrieval approach to maximize the complementarity of natural language and visual representations. An efficient method to fuse multiple input image features is also proposed to extract comprehensive information from various vehicles along with pseudo labeling and efficient post-processing techniques to enhance retrieval accuracy. The proposed method achieved the 3rd ranking of Mean Reciprocal Rank (MRR) score of 0.4795 on the test set for the Challenge Track 2: Tracked-Vehicle Retrieval by Natural Language Descriptions 2023. Source code for the proposed approaches is openly accessible at https://github.com/anminhhung/AI-City-2023-Track2.