What Should Be Equivariant in Self-Supervised Learning
Self-supervised learning (SSL) aims to learn feature representation without human-annotated data. Existing methods approach this goal by encouraging the feature representations to be invariant under a set of task-irrelevant transformations and distortions defined a priori. However, multiple studies have shown that such an assumption often limits the expressive power of the representations and model would perform poorly when downstream tasks violate this assumption. For example, being invariant to rotations would prevent features from retaining enough information to estimate object rotation angles. This suggests additional manual work and domain knowledge are required for selecting augmentation types during SSL. In this work, we relax the transformation-invariance assumption by introducing a SSL framework that encourages the feature representations to preserve the order of transformation scale in embedding space for some transformations while maintaining invariance to other transformations. This allows the learned feature representations to retain information about task-relevant transformations. In addition, this framework gives rise to a handy mechanism to determine the augmentation types to which the features representations should be invariant and equivariant during SSL. We demonstrate the effectiveness of our method on various datasets such as Fruits 360, Caltech-UCSD Birds 200, and Blood cells dataset.