DTLLM-VLT: Diverse Text Generation for Visual Language Tracking Based on LLM

Xuchen Li, Xiaokun Feng, Shiyu Hu, Meiqi Wu, Dailing Zhang, Jing Zhang, Kaiqi Huang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 7283-7292

Abstract


Visual Language Tracking (VLT) enhances single object tracking (SOT) by integrating natural language descriptions from a video for the precise tracking of a specified object. By leveraging high-level semantic information VLT guides object tracking alleviating the constraints associated with relying on a visual modality. Nevertheless most VLT benchmarks are annotated in a single granularity and lack a coherent semantic framework to provide scientific guidance. Moreover coordinating human annotators for high-quality annotations is laborious and time-consuming. To address these challenges we introduce DTLLM-VLT which automatically generates extensive and multi-granularity text to enhance environmental diversity. (1) DTLLM-VLT generates scientific and multi-granularity text descriptions using a cohesive prompt framework. Its succinct and highly adaptable design allows seamless integration into various visual tracking benchmarks. (2) We select three prominent benchmarks to deploy our approach: short-term tracking long-term tracking and global instance tracking. We offer four granularity combinations for these benchmarks considering the extent and density of semantic information thereby showcasing the practicality and versatility of DTLLM-VLT. (3) We conduct comparative experiments on VLT benchmarks with different text granularities evaluating and analyzing the impact of diverse text on tracking performance. Conclusionally this work leverages LLM to provide multi-granularity semantic information for VLT task from efficient and diverse perspectives enabling fine-grained evaluation of multi-modal trackers. In the future we believe this work can be extended to more datasets to support vision datasets understanding.

Related Material


[pdf]
[bibtex]
@InProceedings{Li_2024_CVPR, author = {Li, Xuchen and Feng, Xiaokun and Hu, Shiyu and Wu, Meiqi and Zhang, Dailing and Zhang, Jing and Huang, Kaiqi}, title = {DTLLM-VLT: Diverse Text Generation for Visual Language Tracking Based on LLM}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {7283-7292} }