-
[pdf]
[arXiv]
[bibtex]@InProceedings{Epstein_2021_ICCV, author = {Epstein, Dave and Wu, Jiajun and Schmid, Cordelia and Sun, Chen}, title = {Learning Temporal Dynamics From Cycles in Narrated Video}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2021}, pages = {1480-1489} }
Learning Temporal Dynamics From Cycles in Narrated Video
Abstract
Learning to model how the world changes as time elapses has proven a challenging problem for the computer vision community. We introduce a self-supervised approach to this problem that solves a multi-modal temporal cycle consistency objective, MMCC, jointly in vision and language. This objective requires a model to learn modality-agnostic functions to predict the future and past that undo each other when composed. We hypothesize that a model trained on this objective will discover long-term temporal dynamics in video. We verify this hypothesis by using the resultant visual representations and predictive models as-is to solve a variety of downstream tasks. Our method outperforms state-of-the-art self-supervised video prediction methods on future action anticipation, temporal image ordering, and arrow-of-time classification tasks, without training on target datasets or their labels.
Related Material