Exploring Phrase Grounding Without Training: Contextualisation and Extension to Text-Based Image Retrieval

Letitia Parcalabescu, Anette Frank; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2020, pp. 962-963

Abstract


Grounding phrases in images links the visual and the textual modalities and is useful for many image understanding and multimodal tasks. All known models heavily rely on annotated data and complex trainable systems to perform phrase grounding -- except for a recent work [38] that proposes a system requiring no training nor aligned data, yet is able to compete with (weakly) supervised systems on popular phrase grounding datasets. We explore and expand the upper bound of such a system, by contextualising both the image and language representation with structured representations. We show that our extensions benefit the model and establish a harder, but fairer baseline for (weakly) supervised models. We also perform a stress-test to assess the further applicability of such a system for creating a sentence-retrieval system requiring no training nor annotated data. We show that such models have a difficult start and a long way to go and that more research is needed.

Related Material


[pdf]
[bibtex]
@InProceedings{Parcalabescu_2020_CVPR_Workshops,
author = {Parcalabescu, Letitia and Frank, Anette},
title = {Exploring Phrase Grounding Without Training: Contextualisation and Extension to Text-Based Image Retrieval},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2020}
}