DynRefer: Delving into Region-level Multimodal Tasks via Dynamic Resolution

Zhao, Yuzhong; Liu, Feng; Liu, Yue; Liao, Mingxiang; Gong, Chen; Ye, Qixiang; Wan, Fang

Yuzhong Zhao, Feng Liu, Yue Liu, Mingxiang Liao, Chen Gong, Qixiang Ye, Fang Wan; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 24742-24752

Abstract

One important task of multimodal models is to translate referred image regions to human preferred language descriptions. Existing methods, however, ignore the resolution adaptability needs of different tasks, which hinders them to find out precise language descriptions. In this study, we propose a DynRefer approach, to pursue high-accuracy region-level referring through mimicking the resolution adaptability of human visual cognition. During training, DynRefer stochastically aligns language descriptions of multimodal tasks with images of multiple resolutions, which are constructed by nesting a set of random views around the referred region. This process essentially constructs a set of region representations, where suitable representations for specific tasks can be matched. During inference, DynRefer performs selectively multimodal referring by sampling proper region representations for tasks from the set of views based on image and task priors. This allows the visual information for referring to better match human preferences, thereby improving the representational adaptability of region-level multimodal models. Experiments show that DynRefer brings mutual improvement upon broad tasks including region-level captioning, open-vocabulary region recognition and attribute detection. Furthermore, DynRefer achieves state-of-the-art results on multiple region-level multimodal tasks using a single model. Code is available at https://github.com/callsys/DynRefer.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Zhao_2025_CVPR, author = {Zhao, Yuzhong and Liu, Feng and Liu, Yue and Liao, Mingxiang and Gong, Chen and Ye, Qixiang and Wan, Fang}, title = {DynRefer: Delving into Region-level Multimodal Tasks via Dynamic Resolution}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2025}, pages = {24742-24752} }