Time-Contrastive Networks: Self-Supervised Learning From Multi-View Observation

Pierre Sermanet, Corey Lynch, Jasmine Hsu, Sergey Levine; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2017, pp. 14-15

Abstract


We propose a self-supervised approach for learning representations of relationships between humans and their environment, including object interactions, attributes, and body pose, entirely from unlabeled videos recorded from multiple viewpoints. We train our representation as an embedding with a triplet loss that opposes simultaneous frames from different viewpoints against temporally adjacent and visually similar frames. This opposition allows to disambiguate the possible explanations for temporal changes in the world. We demonstrate that our model can correctly identify corresponding steps in complex object interactions, such as pouring, between different videos and with different instances. We also show what is, to the best of our knowledge, the first self-supervised results for end-to-end imitation learning of human motions with a real robot.

Related Material


[pdf] [arXiv]
[bibtex]
@InProceedings{Sermanet_2017_CVPR_Workshops,
author = {Sermanet, Pierre and Lynch, Corey and Hsu, Jasmine and Levine, Sergey},
title = {Time-Contrastive Networks: Self-Supervised Learning From Multi-View Observation},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {July},
year = {2017}
}