- [pdf] [supp]
Teacher-Generated Spatial-Attention Labels Boost Robustness and Accuracy of Contrastive Models
Human spatial attention conveys information about theregions of visual scenes that are important for perform-ing visual tasks. Prior work has shown that the informa-tion about human attention can be leveraged to benefit var-ious supervised vision tasks. Might providing this weakform of supervision be useful for self-supervised represen-tation learning? Addressing this question requires collect-ing large datasets with human attention labels. Yet, col-lecting such large scale data is very expensive. To addressthis challenge, we construct an auxiliary teacher model topredict human attention, trained on a relatively small la-beled dataset. This teacher model allows us to generate im-age (pseudo) attention labels for ImageNet. We then traina model with a primary contrastive objective; to this stan-dard configuration, we add a simple output head trained topredict the attentional map for each image, guided by thepseudo labels from teacher model. We measure the qual-ity of learned representations by evaluating classificationperformance from the frozen learned embeddings as wellas performance on image retrieval tasks. We find that thespatial-attention maps predicted from the contrastive modeltrained with teacher guidance aligns better with human at-tention compared to vanilla contrastive models. Moreover,we find that our approach improves classification accuracyand robustness of the contrastive models on ImageNet andImageNet-C. Further, we find that model representationsbecome more useful for image retrieval task as measuredby precision-recall performance on ImageNet, ImageNet-C,CIFAR10, and CIFAR10-C datasets.