Transferring Foundation Models for Generalizable Robotic Manipulation

Yang, Jiange; Tan, Wenhui; Jin, Chuhao; Yao, Keling; Liu, Bei; Fu, Jianlong; Song, Ruihua; Wu, Gangshan; Wang, Limin

Jiange Yang, Wenhui Tan, Chuhao Jin, Keling Yao, Bei Liu, Jianlong Fu, Ruihua Song, Gangshan Wu, Limin Wang; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 1999-2010

Abstract

Improving the generalization capabilities of general-purpose robotic manipulation in real world has long been a significant challenge. Existing approaches often rely on collecting large-scale robotic data which is costly and time-consuming. However due to insufficient diversity of data they typically suffer from limiting their capability in open-domain scenarios with new objects and diverse environments. In this paper we propose a novel paradigm that effectively leverages language-reasoning segmentation mask generated by internet-scale foundation models to condition robot manipulation tasks. By integrating the mask modality which incorporates semantic geometric and temporal correlation priors derived from vision foundation models into the end-to-end policy model our approach can effectively and robustly perceive object pose and enable sample-efficient generalization learning including new object instances semantic categories and unseen backgrounds. We first introduce a series of foundation models to ground natural language demands across multiple tasks. Secondly we develop a two-stream 2D policy model based on imitation learning which processes raw images and object masks to predict robot actions with a local-global perception manner. Extensive real-world experiments conducted on a Franka Emika robot and a low-cost dual-arm robot demonstrate the effectiveness of our proposed paradigm and policy. Demos can be found in link1 or link2 and our code will be released at https://github.com/MCG-NJU/TPM.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Yang_2025_WACV, author = {Yang, Jiange and Tan, Wenhui and Jin, Chuhao and Yao, Keling and Liu, Bei and Fu, Jianlong and Song, Ruihua and Wu, Gangshan and Wang, Limin}, title = {Transferring Foundation Models for Generalizable Robotic Manipulation}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {1999-2010} }