Representation Learning From Videos In-the-Wild: An Object-Centric Approach

Rob Romijnders, Aravindh Mahendran, Michael Tschannen, Josip Djolonga, Marvin Ritter, Neil Houlsby, Mario Lucic; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2021, pp. 177-187

Abstract


We propose a method to learn image representations from uncurated videos. We combine a supervised loss from off-the-shelf object detectors and self-supervised losses which naturally arise from the video-shot-frame-object hierarchy present in each video. We report competitive results on 19 transfer learning tasks of the Visual Task Adaptation Benchmark (VTAB), and on 8 out-of-distribution-generalization tasks, and discuss the benefits and shortcomings of the proposed approach. In particular, it improves over the baseline on all 18/19 few-shot learning tasks and 8/8 out-of-distribution generalization tasks. Finally, we perform several ablation studies and analyze the impact of the pretrained object detector on the performance across this suite of tasks.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Romijnders_2021_WACV, author = {Romijnders, Rob and Mahendran, Aravindh and Tschannen, Michael and Djolonga, Josip and Ritter, Marvin and Houlsby, Neil and Lucic, Mario}, title = {Representation Learning From Videos In-the-Wild: An Object-Centric Approach}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2021}, pages = {177-187} }