GraDual: Graph-Based Dual-Modal Representation for Image-Text Matching

Siqu Long, Soyeon Caren Han, Xiaojun Wan, Josiah Poon; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2022, pp. 3459-3468


Image-text retrieval task is a challenging task. It aims to measure the visual-semantic correspondence between an image and a text caption. This is tough mainly because the image lacks semantic context information as in its corresponding text caption, and the text representation is very limited to fully describe the details of an image. In this paper, we introduce Graph-based Dual-modal Representations (GraDual), including Vision-Integrated Text Embedding (VITE) and Context-Integrated Visual Embedding (CIVE), for image-text retrieval. The GraDual improves the coverage of each modality by exploiting textual context semantics for the image representation, and using visual features as a guidance for the text representation. To be specific, we design: 1) a dual-modal graph representation mechanism to solve the lack of coverage issue for each modality. 2) an intermediate graph embedding integration strategy to enhance the important pattern across other modality global features. 3) a dual-modal driven cross-modal matching network to generate a filtered representation of another modality. Extensive experiments on two benchmark datasets, MS-COCO and Flickr30K, demonstrates the superiority of the proposed GraDual in comparison to state-of-the-art methods.

Related Material

[pdf] [supp]
@InProceedings{Long_2022_WACV, author = {Long, Siqu and Han, Soyeon Caren and Wan, Xiaojun and Poon, Josiah}, title = {GraDual: Graph-Based Dual-Modal Representation for Image-Text Matching}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2022}, pages = {3459-3468} }