-
[pdf]
[supp]
[bibtex]@InProceedings{Kukal_2025_WACV, author = {Kukal, Rupanjali and Patravali, Jay and Yu, Fuxun and Singh, Simranjit and Karianakis, Nikolaos and Madhok, Rishi}, title = {Click\&Describe: Multimodal Grounding and Tracking for Aerial Objects}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {6011-6021} }
Click&Describe: Multimodal Grounding and Tracking for Aerial Objects
Abstract
The fusion of multiple modalities such as vision and language has led to significant progress in grounding and tracking tasks. However this success has not yet translated to aerial single-object tracking (SOT) due to the lack of text annotations in existing aerial SOT datasets. To overcome this limitation we provide text annotations for five existing aerial datasets designed to support and promote multi-modal research in the aerial tracking domain. Furthermore to address challenges such as small object dimensions similar-looking objects and target size fluctuations we introduce a third input modality: click (or point prompt). To offer a user-friendly and interactive alternative to precise bounding box annotations we seamlessly integrate click and language information in the model's input. This enables approximate target specification with reduced effort and time. We introduce CLaVi a novel multimodal framework that redefines input interaction by incorporating multiple modalities. This integration improves target localization and tracking efficiency providing a significant advancement in the way input is provided to the model. Furthermore we conduct experiments on the five datasets to provide AerTrack-460 benchmark to validate the effectiveness of our approach. AerTrack-460 benchmark shows competitive performance and in some cases outperforms previous language-based grounding and tracking techniques setting a strong baseline for future research. Code and data will be made available soon.
Related Material