Leveraging Task-Specific Pre-Training To Reason Across Images and Videos

Sadhu, Arka; Nevatia, Ram

Arka Sadhu, Ram Nevatia; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 5794-5804

Abstract

We explore the Reasoning Across Images and Video (RAIV) task, which requires models to reason on a pair of visual inputs comprising various combinations of images and/or videos. Previous work in this area has been limited to image pairs focusing primarily on the existence and/or cardinality of objects. To address this, we leverage existing datasets with rich annotations to generate semantically meaningful queries about actions, objects, and their relationships. We introduce new datasets that encompass visually similar inputs, reasoning over images, across images and videos, or across videos. Recognizing the distinct nature of RAIV compared to existing pre-training objectives which work on single image-text pairs, we explore task-specific pre-training, wherein a pre-trained model is trained on an objective similar to downstream tasks without utilizing fine-tuning datasets. Experiments with several state-of-the-art pre-trained image-language models reveal that task-specific pre-training significantly enhances performance on downstream datasets, even in the absence of additional pre-training data. We provide further ablative studies to guide future work.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Sadhu_2024_WACV, author = {Sadhu, Arka and Nevatia, Ram}, title = {Leveraging Task-Specific Pre-Training To Reason Across Images and Videos}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2024}, pages = {5794-5804} }