AffordanceLLM: Grounding Affordance from Vision Language Models

Shengyi Qian, Weifeng Chen, Min Bai, Xiong Zhou, Zhuowen Tu, Li Erran Li; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 7587-7597


Affordance grounding refers to the task of finding the area of an object with which one can interact. It is a fundamental but challenging task as a successful solution requires the comprehensive understanding of a scene in multiple aspects including detection localization and recognition of objects with their parts of geo-spatial configuration/layout of the scene of 3D shapes and physics as well as of the functionality and potential interaction of the objects and humans. Much of the knowledge is hidden and beyond the image content with the supervised labels from a limited training set. In this paper we make an attempt to improve the generalization capability of the current affordance grounding by taking the advantage of the rich world abstract and human-object-interaction knowledge from pre-trained large-scale vision language models. Under the AGD20K benchmark our proposed model demonstrates a significant performance gain over the competing methods for in-the-wild object affordance grounding. We further demonstrate it can ground affordance for objects from random Internet images even if both objects and actions are unseen during training.

Related Material

[pdf] [arXiv]
@InProceedings{Qian_2024_CVPR, author = {Qian, Shengyi and Chen, Weifeng and Bai, Min and Zhou, Xiong and Tu, Zhuowen and Li, Li Erran}, title = {AffordanceLLM: Grounding Affordance from Vision Language Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {7587-7597} }