Natural Language-Based Vehicle Retrieval With Explicit Cross-Modal Representation Learning
On the account of the explosive growth in the large-scale transportation videos, vehicle retrieval plays an important role in the public transportation security and the intelligent transport system recently. Most vehicle retrieval algorithms are vision-based and consist of vehicle re-identification and vehicle tracking. However, the performance of vision-based vehicle retrieval algorithms is constrained as the limited information provided by traffic video streams. In this paper, we propose a contrastive cross-modal vehicle retrieval solution, maximizing the value of the complementation between natural language representation and vision representation. The framework of the proposed solution includes: (1) Preprocess a source video in four ways for generating local motional semantics and global motional semantics; (2) Correspondingly, preprocess relevant description sentences in two ways, including Textual Local Instance Semantics Extraction (TLISE) and Textual Local Motional Semantics Extraction (TLMSE); (3) Use a two-stream architecture model with four visual encoders and four text encoders to extract visual features and textual embeddings; (4) Fuse visual features and textual embeddings respectively by concatenating them along the feature channel in the order of importance, and use them for retrieval. By using the proposed solution, we achieved MRR score of 33.20%, ranking the 7th place in the AI City Challenge 2022 Track 2. The code is publicly available at https://github.com/Katherinaxxx/2022AICITY_T2.