Query-Guided Attention in Vision Transformers for Localizing Objects Using a Single Sketch

Tripathi, Aditay; Mishra, Anand; Chakraborty, Anirban

Aditay Tripathi, Anand Mishra, Anirban Chakraborty; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 1083-1092

Abstract

In this study, we explore sketch-based object localization on natural images. Given a crude hand-drawn object sketch, the task is to locate all instances of that object in the target image. This problem proves difficult due to the abstract nature of hand-drawn sketches, variations in the style and quality of sketches, and the large domain gap between the sketches and the natural images. Existing solutions address this using attention-based frameworks to merge query information into image features. Yet, these methods often integrate query features after independently learning image features, causing inadequate alignment and as a result incorrect localization. In contrast, we propose a novel sketch-guided vision transformer encoder that uses cross-attention after each block of the transformer-based image encoder to learn query-conditioned image features, leading to stronger alignment with the query sketch. Further, at the decoder's output, object and sketch features are refined better to align the representation of objects with the sketch query, thereby improving localization. The proposed model also generalizes to the object categories not seen during training, as the target image features learned by the proposed model are query-aware. Our framework can utilize multiple sketch queries via a trainable novel sketch fusion strategy. The model is evaluated on the images from the public benchmark, MS-COCO, using the sketch queries from QuickDraw! and Sketchy datasets. Compared with existing localization methods, the proposed approach gives a 6.6% and 8.0% improvement in mAP for seen objects using sketch queries from QuickDraw! and Sketchy datasets, respectively, and a 12.2% improvement in AP@50 for large objects that are 'unseen' during training.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Tripathi_2024_WACV, author = {Tripathi, Aditay and Mishra, Anand and Chakraborty, Anirban}, title = {Query-Guided Attention in Vision Transformers for Localizing Objects Using a Single Sketch}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2024}, pages = {1083-1092} }