Confidence-aware Pseudo-label Learning for Weakly Supervised Visual Grounding

Yang Liu, Jiahua Zhang, Qingchao Chen, Yuxin Peng; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 2828-2838

Abstract


Visual grounding aims at localizing the target object in image which is most related to the given free-form natural language query. As labeling the position of target object is labor-intensive, the weakly supervised methods, where only image-sentence annotations are required during model training have recently received increasing attention. Most of the existing weakly-supervised methods first generate region proposals via pre-trained object detectors and then employ either cross-modal similarity score or reconstruction loss as the criteria to select proposal from them. However, due to the cross-modal heterogeneous gap, these method often suffer from high confidence spurious association and model prone to error propagation. In this paper, we propose Confidence-aware Pseudo-label Learning (CPL) to overcome the above limitations. Specifically, we first adopt both the uni-modal and cross-modal pre-trained models and propose conditional prompt engineering to automatically generate multiple `descriptive, realistic and diverse' pseudo language queries for each region proposal, and then establish reliable cross-modal association for model training based on the uni-modal similarity score (between pseudo and real text queries). Secondly, we propose a confidence-aware pseudo label verification module which reduces the amount of noise encountered in the training process and the risk of error propagation. Experiments on five widely used datasets validate the efficacy of our proposed components and demonstrate state-of-the-art performance.

Related Material


[pdf]
[bibtex]
@InProceedings{Liu_2023_ICCV, author = {Liu, Yang and Zhang, Jiahua and Chen, Qingchao and Peng, Yuxin}, title = {Confidence-aware Pseudo-label Learning for Weakly Supervised Visual Grounding}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {2828-2838} }