Task-aware Cross-modal Feature Refinement Transformer with Large Language Models for Visual Grounding

Wenbo Chen, Zhen Xu, Ruotao Xu, Si Wu, Hau-San Wong; Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 3931-3941

Abstract


The goal of visual grounding is to establish connections between target objects and textual descriptions. Large Language Models (LLMs) have demonstrated strong comprehension abilities across a variety of visual tasks. To establish precise associations between the text and the corresponding visual region, we propose a Task-aware Cross-modal feature Refinement Transformer with LLMs for visual grounding, and our model is referred to as TCRT. To enable the LLM trained solely on text to understand images, we introduce an LLM adaptation module that extracts text-related visual features to bridge the domain discrepancy between the textual and visual modalities. We feed the text and visual features into the LLM to obtain task-aware priors. To enable the priors to guide feature fusion process, we further incorporate a cross-modal feature fusion module, which allows task-aware embeddings to refine visual features and facilitate information interaction between the Referring Expression Comprehension (REC) and Referring Expression Segmentation (RES) tasks. We have performed extensive experiments to verify the effectiveness of the main components and demonstrate the superior performance of the proposed TCRT over state-of-the-art end-to-end visual grounding methods on RefCOCO, RefCOCOg, RefCOCO+ and ReferItGame.

Related Material


[pdf]
[bibtex]
@InProceedings{Chen_2025_CVPR, author = {Chen, Wenbo and Xu, Zhen and Xu, Ruotao and Wu, Si and Wong, Hau-San}, title = {Task-aware Cross-modal Feature Refinement Transformer with Large Language Models for Visual Grounding}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {3931-3941} }