Doubling Down: Sparse Grounding With an Additional, Almost-Matching Caption for Detection-Oriented Multimodal Pretraining
A common paradigm in deep learning applications for computer vision is self-supervised pretraining followed by supervised fine-tuning on a target task. In the self-supervision step, a model is trained in a supervised fashion, but the source of supervision needs to be implicitly defined by the data. Image-caption alignment is often used as such a source of implicit supervision in multimodal pretraining, and grounding (i.e., matching word tokens with visual tokens) is one way to exploit it. We introduce a strategy to take advantage of an underexplored structure in image-caption datasets: the relationship between captions matched with different images but mentioning the same objects. Given an image-caption pair, we find an additional caption that mentions one of the objects the first caption mentions, and we impose a sparse grounding between the image and the second caption so that only a few word tokens are grounded with the image. Our goal is to learn a better feature representation for the objects mentioned by both captions, encouraging grounding between the additional caption and the image to focus on the common objects only. We report superior grounding performance when comparing our approach with a previously-published pretraining strategy, and we show the benefit of our proposed double-caption grounding on two downstream detection tasks: supervised detection and open-vocabulary detection.