Tune-An-Ellipse: CLIP Has Potential to Find What You Want

Jinheng Xie, Songhe Deng, Bing Li, Haozhe Liu, Yawen Huang, Yefeng Zheng, Jurgen Schmidhuber, Bernard Ghanem, Linlin Shen, Mike Zheng Shou; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13723-13732

Abstract


Visual prompting of large vision language models such as CLIP exhibits intriguing zero-shot capabilities. A manually drawn red circle commonly used for highlighting can guide CLIP's attention to the surrounding region to identify specific objects within an image. Without precise object proposals however it is insufficient for localization. Our novel simple yet effective approach i.e. Differentiable Visual Prompting enables CLIP to zero-shot localize: given an image and a text prompt describing an object we first pick a rendered ellipse from uniformly distributed anchor ellipses on the image grid via visual prompting then use three loss functions to tune the ellipse coefficients to encapsulate the target region gradually. This yields promising experimental results for referring expression comprehension without precisely specified object proposals. In addition we systematically present the limitations of visual prompting inherent in CLIP and discuss potential solutions.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Xie_2024_CVPR, author = {Xie, Jinheng and Deng, Songhe and Li, Bing and Liu, Haozhe and Huang, Yawen and Zheng, Yefeng and Schmidhuber, Jurgen and Ghanem, Bernard and Shen, Linlin and Shou, Mike Zheng}, title = {Tune-An-Ellipse: CLIP Has Potential to Find What You Want}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {13723-13732} }