VLMs-Guided Representation Distillation for Efficient Vision-Based Reinforcement Learning

Haoran Xu, Peixi Peng, Guang Tan, Yiqian Chang, Luntong Li, Yonghong Tian; Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 29534-29544

Abstract


Vision-based Reinforcement Learning (VRL) attempts to establish associations between visual inputs and optimal actions through interactions with the environment. Given the high-dimensional and complex nature of visual data, it becomes essential to learn policy upon high-quality state representation. To this end, existing VRL methods primarily rely on interaction-collected data, combined with self-supervised auxiliary tasks. However, two key challenges remain: limited data samples and a lack of task-relevant semantic constraints. To tackle this, we propose DGC, a method that distills guidance from Visual Language Models (VLMs) alongside self-supervised learning into a compact VRL agent. Notably, we leverage the state representation capabilities of VLMs, rather than their decision-making abilities. Within DGC, a novel prompting-reasoning pipeline is designed to convert historical observations and actions into usable supervision signals, enabling semantic understanding within the compact visual encoder. By leveraging these distilled semantic representations, the VRL agent achieves significant improvements in the sample efficiency. Extensive experiments on the Carla benchmark demonstrate our state-of-the-art performance.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Xu_2025_CVPR, author = {Xu, Haoran and Peng, Peixi and Tan, Guang and Chang, Yiqian and Li, Luntong and Tian, Yonghong}, title = {VLMs-Guided Representation Distillation for Efficient Vision-Based Reinforcement Learning}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {29534-29544} }