Pointing at Parts: Training-Free Few-Shot Grounding in Multimodal LLMs

Shiang-Feng Tsai, Yuan-Hong Liao, Jin-Cheng Jhang, Nan Qiao, Min Sun; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 33922-33932

Abstract


Part-level pointing is important for fine-grained interaction and reasoning, yet existing Multimodal Large Language Models (MLLMs) remain limited to instance-level pointing. Part-level pointing presents unique challenges: annotation is costly, parts are long-tail distributed, and many are difficult to specify precisely in language. We introduce POinting at Parts (POP), a training-free, plug-and-play approach that addresses these challenges under a few-shot setup. POP fuses textual and visual attention maps with self-supervised visual correspondences from query image and few-shot examples. On average across the three evaluated datasets, POP achieves accuracy gains of up to 8.9 points in the one-shot setting and 16.4 points in the three-shot setting for the pointing-capable MLLMs--Qwen2.5-VL, Ovis2.5, and Molmo. Notably, even MLLMs without pointing capability benefit significantly from the proposed approach. These results establish a simple yet effective path toward fine-grained spatial grounding in MLLMs.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Tsai_2026_CVPR, author = {Tsai, Shiang-Feng and Liao, Yuan-Hong and Jhang, Jin-Cheng and Qiao, Nan and Sun, Min}, title = {Pointing at Parts: Training-Free Few-Shot Grounding in Multimodal LLMs}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {33922-33932} }