Revisiting Counterfactual Problems in Referring Expression Comprehension

Zhihan Yu, Ruifan Li; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13438-13448

Abstract


Traditional referring expression comprehension (REC) aims to locate the target referent in an image guided by a text query. Several previous methods have studied on the Counterfactual problem in REC (C-REC) where the objects for a given query cannot be found in the image. However these methods focus on the overall image-text or specific attribute mismatch only. In this paper we address the C-REC problem from a deep perspective of fine-grained attributes. To this aim we first propose a fine-grained counterfactual sample generation method to construct C-REC datasets. Specifically we leverage pre-trained language model such as BERT to modify the attribute words in the queries obtaining the corresponding counterfactual samples. Furthermore we propose a C-REC framework. We first adopt three encoders to extract image text and attribute features. Then our dual-branch attentive fusion module fuses these cross-modal features with two branches by an attention mechanism. At last two prediction heads generate a bounding box and a counterfactual label respectively. In addition we incorporate contrastive learning with the generated counterfactual samples as negatives to enhance the counterfactual perception. Extensive experiments show that our framework achieves promising performance on both public REC datasets RefCOCO/+/g and our constructed C-REC datasets C-RefCOCO/+/g. The code and data are available at https://github.com/Glacier0012/CREC.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Yu_2024_CVPR, author = {Yu, Zhihan and Li, Ruifan}, title = {Revisiting Counterfactual Problems in Referring Expression Comprehension}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {13438-13448} }