Learning Temporal Dynamics From Cycles in Narrated Video

Dave Epstein, Jiajun Wu, Cordelia Schmid, Chen Sun; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 1480-1489

Abstract


Learning to model how the world changes as time elapses has proven a challenging problem for the computer vision community. We introduce a self-supervised approach to this problem that solves a multi-modal temporal cycle consistency objective, MMCC, jointly in vision and language. This objective requires a model to learn modality-agnostic functions to predict the future and past that undo each other when composed. We hypothesize that a model trained on this objective will discover long-term temporal dynamics in video. We verify this hypothesis by using the resultant visual representations and predictive models as-is to solve a variety of downstream tasks. Our method outperforms state-of-the-art self-supervised video prediction methods on future action anticipation, temporal image ordering, and arrow-of-time classification tasks, without training on target datasets or their labels.

Related Material


[pdf] [arXiv]
[bibtex]
@InProceedings{Epstein_2021_ICCV, author = {Epstein, Dave and Wu, Jiajun and Schmid, Cordelia and Sun, Chen}, title = {Learning Temporal Dynamics From Cycles in Narrated Video}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2021}, pages = {1480-1489} }