Rethinking Intermediate Representation for VLM-based Robot Manipulation

Weiliang Tang, Jialin Gao, Jia-Hui Pan, Gang Wang, Li Erran Li, Yun-Hui Liu, Mingyu Ding, Pheng-Ann Heng, Chi-Wing Fu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 29652-29662

Abstract


Vision-Language Model (VLM) is now an important component to enable robust robot manipulation. Yet, using it to translate human instructions into an action-resolvable intermediate representation often needs a tradeoff between VLM-comprehensibility and generalizability. Inspired by context-free grammar structure, we design the Semantic Assembly representation named SEAM, by decomposing the intermediate representation into vocabulary and grammar. Doing so leads us to a concise vocabulary of semantically-rich operations and a VLM-friendly grammar for handling diverse unseen tasks. Also, we design a novel open-vocabulary segmentation paradigm with an in-context learning strategy to precisely localize fine-grained object parts for manipulation (e.g., cup handle, teapot opening) effectively with the shortest inference time over all state-of-the-art parallel works. We then formulate new metrics for action-generalizability and VLM-comprehensibility to evaluate mainstream representations, demonstrating the strong performance of SEAM on both aspects. Extensive real-world experiments further manifest the SOTA performance of SEAM under varying settings and tasks.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Tang_2026_CVPR, author = {Tang, Weiliang and Gao, Jialin and Pan, Jia-Hui and Wang, Gang and Li, Li Erran and Liu, Yun-Hui and Ding, Mingyu and Heng, Pheng-Ann and Fu, Chi-Wing}, title = {Rethinking Intermediate Representation for VLM-based Robot Manipulation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {29652-29662} }