DUN: Dual-Path Temporal Matching Network for Natural Language-Based Vehicle Retrieval
Retrieving vehicles matching natural language descriptions from collections of videos is a novel and uniquely challenging task, requiring consideration not only of vehicle types and colors, but also of temporal relations, e.g., "A white crossover keeping straight behind a silver hatchback." To perform this task, we propose Dual-path Temporal Matching Network (DUN). DUN uses a pre-trained CNN and GloVe to extract visual and text features, respectively, and GRUs to mine temporal relationships in videos and sentences. Furthermore, the proposed network can attain superior performance by including techniques such as re-ranking. With its simple structure, DUN achieved second place on the AI City Challenge 2021 Track 5.