Image-caption Difficulty for Efficient Weakly-supervised Object Detection from In-the-wild Data

Giacomo Nebbia, Adriana Kovashka; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 2596-2605

Abstract


In recent years we have witnessed the collection of larger and larger multi-modal image-caption datasets: from hundreds of thousands such pairs to hundreds of millions. Such datasets allow researchers to build powerful deep learning models at the cost of requiring intensive computational resources. In this work we ask: can we use such datasets efficiently without sacrificing performance? We tackle this problem by extracting difficulty scores from each image-caption sample and by using such scores to make training more effective and efficient. We compare two ways to use difficulty scores to influence training: filtering a representative subset of each dataset and ordering samples through curriculum learning. We analyze and compare difficulty scores extracted from a single modality---captions (i.e. caption length and number of object mentions) or images (i.e. region proposals' size and number)---or based on alignment of image-caption pairs (i.e. CLIP and concreteness). We focus on Weakly-Supervised Object Detection where image-level labels are extracted from captions. We discover that (1) combining filtering and curriculum learning can achieve large gains in performance but not all methods are stable across experimental settings (2) single-modality scores often outperform alignment-based ones (3) alignment scores show the largest gains when training time is limited.

Related Material


[pdf]
[bibtex]
@InProceedings{Nebbia_2024_CVPR, author = {Nebbia, Giacomo and Kovashka, Adriana}, title = {Image-caption Difficulty for Efficient Weakly-supervised Object Detection from In-the-wild Data}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {2596-2605} }