Cross-Modal Target Retrieval for Tracking by Natural Language

Yihao Li, Jun Yu, Zhongpeng Cai, Yuwen Pan; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2022, pp. 4931-4940


Tracking by natural language specification in a video is a challenging task in computer vision. Distinct from initializing the target state only by the bounding box in the first frame, language specification has a strong potential to assist visual object trackers to capture appearance variation and eliminate semantic ambiguity of the tracked object. In this paper, we carefully design a unified local-global-search framework from the perspective of cross-modal retrieval, including a local tracker, an adaptive retrieval switch module, and a target-specific retrieval module. The adaptive retrieval switch module aligns semantics from the visual signal and the lingual description of the target using three sub-modules, i.e., object-aware attention memory, part-aware cross-attention, and vision-language contrast, which achieve an automatic switch between local search and global search. When booting the global search mechanism, the target-specific retrieval module re-localizes the missing target in the image-wide range via an efficient vision-language guided proposal selector and target-text match. Numerous experimental results on three prevailing benchmarks show the effectiveness and generalization of our framework.

Related Material

@InProceedings{Li_2022_CVPR, author = {Li, Yihao and Yu, Jun and Cai, Zhongpeng and Pan, Yuwen}, title = {Cross-Modal Target Retrieval for Tracking by Natural Language}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2022}, pages = {4931-4940} }